SQL,

Mastering SQL: How to detect and avoid 34+ Common SQL Antipatterns

Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.


Published on December 6, 2023
Updated on December 18, 2024

Overview

Are you looking for reasons why your SQL queries are slow or costly. Look no more. This blog post showcases a comprehensive list of 34 SQL antipatterns. Use it as a reference to improve your own SQL queries.

What is an SQL antipattern?

SQL antipatterns can lead to poor performance, maintainability issues, and incorrect results in database applications. They are practices or patterns that should be avoided in favor of more efficient and effective SQL query and database design approaches.

Your subscription could not be saved. Please try again.
You're In! Welcome to FastForward Congratulations on successfully subscribing to the FastForward Data Engineering Newsletter! You're now part of a growing community of 15,000+ data engineers who are staying ahead in the ever-evolving world of data.

FlowForward.

All Things Data Engineering
Straight to Your Inbox!

What are the benefits of fixing SQL antipatterns?

Detecting SQL antipatterns is called SQL analysis. SQL analysis inspects SQL code for common mistakes and antipatterns. It is not just a technical exercise. It has a direct impact on the bottom line of a business. It directly contributes to the operational efficiency, cost-effectiveness, and strategic agility.

Here is a list of the main benefits of SQL analysis.

  • Quality Assurance: SQL code analysis helps ensure the quality and consistency of SQL code, which is crucial for reliable database operations.
  • Cost Reduction: By identifying issues like syntax errors, potential performance inefficiencies, or deviations from coding standards, businesses can avoid costly runtime errors and system downtimes.
  • Time Efficiency: Developers can quickly identify and rectify issues before the code goes into production, saving time and resources.
  • Scalability: Ensures that the SQL code is scalable and maintainable, which is important for businesses as they grow and handle more data.
  • Resource Management: SQL code analysis helps in optimizing the use of database resources, which can lead to cost savings, especially in cloud-based or pay-per-use environments.
  • Data Accuracy: It ensures that the SQL code not only runs but also produces the correct and expected results, which is crucial for business decision-making

Not convinced yet?

The next section goes through a case study of a real world example to show you the differences hands on. SQL analysis can have a direct impact on the bottom line.

SQL antipattern case study. Slashing Compute Costs. Preventing Database Meltdown.

JOIN explosion is a phenomenon in databases that can have adverse effects on query performance and system resources. It occurs when multiple tables are joined in a way that results in a much larger number of rows than expected. Here are the most important adverse effects of join explosion:

  • Increased Compute Costs: Join explosion significantly increases the computational load on the database server. When tables are joined, each row in one table is combined with every row in the other table, resulting in a large number of intermediate rows to process.
  • As the number of rows in the result set grows exponentially due to join explosion, query performance can degrade significantly. Queries can become extremely slow and may time out or never complete.
  • Resource Consumption: Join explosion consumes significant system resources, including CPU, memory, and disk space. This can lead to resource contention, slowing down other database operations and causing system-wide performance meltdown.
  • Risk of Data Inaccuracy: Join explosion can lead to incorrect results if not handled properly. Duplicate or incomplete data may be generated in the result set, leading to data quality issues.

We scanned the query history of a client and came across the following SQL code. In this case a small difference can have a huge impact.

The mistake of the SQL developer was to mix up the table alias names resulting in an incorrect join and the dreaded join explosion.

 

Bad SQL*

Good SQL

SQL Code

SELECT *

FROM PATIENT PT

LEFT

JOIN CEXAM CE

ON PT.CLINIC_NM=CE.CLINIC_NM

AND PT.CEXAMID=CE.CEXAMID

LEFT

JOIN CTEST CT

ON PT.CLINIC_NM=CE.CLINIC_NM

AND PT.TESTID=CT.TESTID

SELECT *

FROM PATIENT PT

LEFT

JOIN CEXAM CE

ON PT.CLINIC_NM=CE.CLINIC_NM

AND PT.CEXAMID=CE.CEXAMID

LEFT

JOIN CTEST CT

ON PT.CLINIC_NM=CT.CLINIC_NM

AND PT.TESTID=CT.TESTID

Execution time

30 minutes

5 seconds

Cost savings

99.75%

* shortened from a more complex query with multiple joins.

We created FlowHigh for SQL analysis.
Use FlowHigh SQL Analyser to protect you from bad performance and incorrect query results.

34 of the most common SQL Antipatterns.

Below is a comprehensive list of common antipatterns sorted by severity. FlowHigh has three levels of severity:

  • Level 1 – Warning
  • Level 2 – Caution
  • Level 3 – Notice

The greater the severity level, the more significant the impact on SQL performance and, in some cases, the accuracy of the data output.

Antipatterns that trigger a warning

Avoid ANSI-89 join syntax

Type

Readability

Problem

The ANSI-89 Join syntax of joining tables in the WHERE clause is difficult to read and has been superseded in the ANSI-92 release of the standard by a more readable syntax. In particular, the (+) operator used in Oracle to define Outer Joins can be confusing because its use is not intuitive.

Solution

Only use the ANSI-92 compliant join syntax.

Legitimate use of the anti pattern

None

Avoid count(*) in the outer join

Type

Performance

Correctness

Problem

All rows—including those in the outer join—are counted when the COUNT(*) aggregate function is used in an outer join. The count of the outer rows is typically not something you want to include in your query.

With this query, we want to count the number of orders for each product. If the query uses count(*), it will also count products that don’t have any orders. The result of the query will be wrong and not what was expected.

product_name

product_id

order_id

sofa

1

1

sofa

1

2

chair

2

NULL

The query would return the following result:

product_name

num_orders

sofa

2

chair

1

Solution

You can fix this by using the count(<column_name>) function. For <column_name>, give a column that can’t be NULL from the table with non preserved rows (the table in the outer join). A primary key column is an example.

Legitimate use of the anti pattern

None

Avoid nesting scalar subqueries in the SELECT statement

Type

Performance

Readability

Correctness

Problem

Scalar subqueries have an impact on readability and performance when used in a SELECT statement.

Solution

Most of the time, you can change a scalar subquery in a SELECT statement into a join.

Legitimate use of the anti pattern

None

Avoid filtering attributes from the non-preserved side of an outer join

Type

Correctness

Problem

Trying to filter the non preserved side of an outer join is a bug in the logic of your code. The WHERE clause filters out unknown values. This means that all outer rows will be filtered out irrespectively. In essence you nullify the outer join.

The value ‘x2345’ for the OrderID column in the Orders table filters out all rows from the OrderDetails table, implicitly converting your outer join to an inner join.

Solution

Rewrite your logic to cater for NULLs

An alternative solution would replace NULLs with a dummy value and then check against the dummy value. This solution is less readable than the previous solution

If you only wanted to show customers whose names were in the order table, you should make your outer join an inner join.

Legitimate use of the anti pattern

The only time when putting a condition in the WHERE clause does not turn a LEFT JOIN into an INNER JOIN is when checking for NULL

Beware of filtering for NULL

Type

Correctness

Problem

Many software developers are caught off-guard by the behaviour of NULL in SQL. Unlike in most programming languages, SQL treats NULL as a special value, different from zero, false, or an empty string.

In SQL, a comparison between a NULL value and any other value (including another NULL) using a comparison operator, e.g. =, !=, < will result in a NULL, which is considered as false for the purposes of a where clause (strictly speaking, it’s “not true”, rather than “false”, but the effect is the same).

The reasoning is that a NULL means “unknown”, so the result of any comparison to a NULL is also “unknown”. So you’ll get no hit on rows by coding where my_column = NULL.

Incorrect usage of NULL with comparison operato

Solution

SQL provides the special syntax for testing if a column is NULL, via IS NULL and IS NOT NULL, which is a special condition to test for a NULL (or not a NULL).

Correct usage of NULL with IS NULL operator

Legitimate use of the anti pattern

None

Avoid implicit column references

Type

Readability

Correctness

Problem

Don’t use implicit columns references in queries with multiple table joins.

In the example above it is not clear if the columns col1, col2, col3, col4 belong to tableA or tableB. Readability suffers and you increase the risk of introducing bugs in the code.

Solution

Explicitly reference each column

In this example each column is referenced by its table alias.

Legitimate use of the anti pattern

If you only have one table in your query you can use implicit column references.

Avoid inner join after outer join in multi-join query

Type

Correctness

Problem

If an outer join is followed by an inner join, all rows from the outer join are thrown away if the predicate in the inner join’s ON clause compares an attribute from the non-preserved side of the outer join and an attribute from a third table. Let’s go through an example

In this query, we left outer join the orders table to the customer table and then inner join the order_line table to the orders table. As a result, only preserved rows from both tables were retained in the result set. In other words we implicitly converted the outer join to an inner join.

The preceding query is equivalent to the following query

Solution

Do not use an inner join against a non preserved table. Here is the correct query to preserve the rows of the order table.

Legitimate use of the anti pattern

None

Avoid implicit cross join

Type

Readability

Correctness

Problem

It is possible to generate a cartesian product using CROSS JOINs. However, you should avoid writing implicit CROSS JOINS with the ANSI-89 syntax.

Tables t1 and t2 are cross joined implicitly without specifying the CROSS JOIN clause.

By using the implicit CROSS JOIN syntax you run the risk of creating cartesian products by accident. This can be catastrophic for query performance and might even crash your database.

Solution

Follow the ANSI-92 SQL syntax when writing explicit CROSS JOINs by using the CROSS JOIN syntax.

Legitimate use of the anti pattern

Never use the implicit CROSS JOIN syntax

Use an alias for derived columns

Type

Readability

Problem

Any derived value, including aggregates, should be given a unique name unless the query is completely ad hoc. The query is now easier to read. The alias may be directly referred to in the filter if the database engine supports it.

Solution

We assign SUM(col2*1.23) the alias gross_price. We use the alias gross_price in the HAVING filter.

Note: Using the alias in the HAVING clause is not supported by all databases.

Legitimate use of the anti pattern

None

Beware of NULL in combination with not equal operator (!=, <>)

Type

Correctness

Problem

When filtering with the Not Equal operator (!=, <>) rows with NULL values in the filtered column are not returned.

Let’s go through an example.

Table with sample data

customer_idage
1NULL
218
325

The following query only returns the row with customer_id 3. It does not return the row with cutsomer_id 1. For this row the age is unknown (NULL) and we don’t know what the age is. It may be 18 or it may not be 18.

The query will skip the cars which don’t have their type defined, irrespectively whether that was intentional or not.

Solution

However, if the intention is to also return records where we know the customer age is not 18 (so not 18 or where we don’t know the age of the customer) then we need to rewrite our query:

The ANSI SQL 99 standard also introduced the IS DISTINCT FROM function. However, it is not supported by all database vendors

Legitimate use of anti pattern

The anti pattern is legitimate if we want to exclude records where the age is unknown (NULL) from our result set.

If the column has a NOT NULL constraint then you can possibly ignore the anti pattern. However, in the spirit of a defensive programming style it is good practice to cater for NULLs. The business rule might change and the NOT NULL constraint dropped in the future.

Avoid unused Common Table Expressions (CTE)

Type

Performance

Readability

Problem

One of the Common Table Expressions (CTEs) that the query defines is not used anywhere else in the query.

Example

The CTE sum_students has been defined but is not referenced in other parts of the query. Only avg_students has been referenced in the query

There are two issues:

  • Including an unused CTE makes the query more complex and less readable. It may confuse other SQL developers
  • It may cause a performance degradation with some databases. Some databases materialise CTEs by default. Other databases inline the CTE. When the database materialises the CTE it will set down the resultset physically or in memory. For some databases you can specify if you want to materialise or inline a CTE. Materialising a resultset allows the database to reuse the result multiple times. Some databases even materialise the resultset of unreferenced CTEs. This requires extra work for no good reason.

Solution

Remove CTEs that aren’t being used, or at least comment them out for future use.

Legitimate use of the anti pattern

None.

Avoid the natural join clause

Type

Readability

Correctness

Problem

The SQL natural join is a type of equi-join that implicitly combines tables based on columns with the same name and type. The join predicate arises implicitly by comparing all columns in both tables that have the same column names in the joined tables.

The following issues may arise:

  • You may accidentally join on the wrong columns, e.g. it is common to add a generic column create_date_time to tables. The natural join would include these column in the join which is most likely incorrect
  • It takes the CBO extra time to parse the query to identify the join column candidates
  • It reduces readability by introducing a non-standard syntax

Solution

Use an ANSI-92 compliant join syntax

Legitimate use of the anti pattern

None

Use LIKE instead of REGEX where LIKE is possible

Type

Performance

Problem

The regular expression engine is much more powerful than the LIKE operator when it comes to pattern matching. However, the regular expression engine requires significantly more resources than LIKE. Whenever you use a regular expression in a filter (WHERE clause), you should carefully assess whether a particular regular expression pattern has its LIKE equivalent or if available LIKE functionality is enough to meet your specific filtering requirement.

Solution

We have rewritten the queries using the regular expression with LIKE

Legitimate use of the antipattern

The following regular expression can not be rewritten with LIKE

Avoid using WHERE to filter aggregate columns

Type

Performance

Readability

Problem

The WHERE clause in combination with a subquery can be used instead of the HAVING clause.

This makes the query hard to read and more complex.

Solution

Use the HAVING clause when filtering on aggregates

Legitimate use of the anti pattern

None

Antipatterns that trigger a Caution

Beware of SELECT *

Type

Performance

Readability

Problem

Whenever a developer uses SELECT *, they commonly pick more columns than they need. Without any justification, the database must free up more system resources. The use of this anti-pattern could cause performance problems, reduce concurrency, and increase costs.

The SELECT * anti pattern has the following impact

  • When columns are renamed or dropped, downstream processes may break in your data engineering and BI work loads
  • More database resources need to be used to scan data at the table level. The impact is significantly worse for databases with columnar storage.
  • More data travels across the network
  • Running SELECT * is expensive. Some database providers, like Google BigQuery, charge customers according to the volume of data accessed. It can literally cost you hundreds of dollars to run a query across a TB-sized table with dozens of columns when we only need four or five. While the extra cost for BigQuery is obvious, you will pay comparable compute costs with other database vendors and have fewer resources available for other users.

Solution

  • Fix the anti pattern for small tables that are part of an ETL or BI application.
  • Apply the LIMIT clause to only retrieve a subset of the data for data exploration use cases. The implementation of LIMIT differs by database vendor and you are advised to consult the explain plan.

SQL without LIMIT takes minutes on a 500 GB table

SQL with LIMIT takes milliseconds for the same query

Example SQL was created and tested on Snowflake

  • Even if we know the names of the columns we need to retrieve it is much quicker to write SELECT * instead of spelling out each individual column. An SQL IDE with autocomplete feature can minimise the amount of work.

Legitimate use of the anti pattern

If this anti pattern appears during data exploration or data discovery, you can ignore it for small tables. The extra expense is probably less than the time you must spend on data engineering if you write out the full column names

For data exploration and data discovery use cases you may need to browse all of the columns in a table to get a better understanding of the data.

Avoid using functions in the WHERE clause

Type

Performance

Problem

There are two reasons to avoid using a function on a column inside a WHERE clause.

  • It is CPU intensive because the function gets called for each record in your result set.
  • It prevents the database from applying an index, sort key or cluster key. As a result, the database needs to read more data from the storage layer. This can have a significant impact on query performance.

The following example demonstrates the anti-pattern in SQL Server. A function wraps around the date_of_birth column.

Solution

Rewrite your query to push the function to the other side of the operator.

One alternative to this would be to pre-calculate the function as part of ETL.

In some databases, such as Oracle, you can create indexes based on functions.

Legitimate use of the anti pattern

  • In some situations, you may not be able to avoid using the function in the WHERE clause.
  • Your data volumes are small.
  • The column you filter against does not have an index or cluster key
  • You have created a function based index for the function on the column

Beware of NULL with arithmetic and string operations

Type

Correctness

Problem

Using arithmetic and string operations on NULLable columns could have side effects that you didn’t plan for.

Table with sample data:

customer_idfirst_namemiddle_namelast_nameage
1hanspeterwurstNULL
2edgarNULLuelzenguelz18

We concatenate the first_name, middle_name and last_name to illustrate the anti pattern for string operations.

As the middle_name is unknown for customer_id 2 the concat string operation also results in an unknown value (NULL). We might have expected “edgar uelzenguelz” as the full name for that customer_id

customer_idfull_name
1hans peter wurst
2NULL

Solution

We recommend using a defensive programming style when writing SQL queries. As a rule, you should always assume any column can be NULL at any point. It’s a good idea to provide a default value for that column. This way you make sure that even if your data becomes NULL the query will work.

For the string operation we can use the COALESCE function to achieve our objective

If you want to be very careful, you can use COALESCE for all of the columns in the string concatenation, not just the ones that can be NULL. This will make sure that your code will still work even if the NULLability of a column changes.

Legitimate case

The columns used in the expression are NOT NULLable.

Beware of implicit self-joins in a correlated subquery

Type

Performance

Readability

Problem

Using an implicit self-join in a correlated subquery in the WHERE clause can cause two problems.

  • Most SQL developers find correlated subqueries difficult to read or debug.
  • Using a correlated subquery may cause a performance issue.

An example. The employee table is used both in the outer and inner query

Solution

Using a window function instead of an implicit self-join will improve the query performance for most cases and also make it easier to understand.

If your database does not support the QUALIFY operator use a subquery

Legitimate use of the anti pattern

In some edge situations the implicit self-join can be more performant. The window function is generally more performant under the following conditions

  • You are dealing with a non-equi join, e.g. >, <, >=, <=. For Non-equi joins the CBO can not use a hash join, which may impact performance.
  • The columns involved in the join condition do not contain indexes or cluster keys.
  • In a self join, there are no table partitions on the columns involved in the join.
  • The self join table is large and has not been reduced on one side of the join by filtering or aggregation. In the example above the inline view has been reduced by aggregation. The resulting self-join is less expensive and may be legitimate.

It is a good idea to run a performance test to compare performance of the query with the self join to the performance of the window function.

Use IN / NOT IN for multiple OR operators

Type

Readability

Problem

Chaining multiple OR operators makes SQL code less readable

Solution

Use the IN / NOT IN operator instead of chaining OR operators. It makes your SQL code more readable.

Legitimate use of the anti pattern

None

Avoid ordinal numbers when using ORDER BY or GROUP BY

Type

Readability

Correctness

Problem

You can order or group the result set based on the ordinal positions of columns that appear in the SELECT clause

ORDER BY

GROUP BY

Using the ordinal positions of columns in the ORDER BY / GROUP BY clause is considered an anti pattern

  • Your SQL becomes much harder to read for other people
  • When you modify the SELECT list, you may forget to make the corresponding changes in the ORDER BY / GROUP BY clause, which will result in errors and bugs.

Solution

Always reference columns by their name and not their ordinal position.

ORDER BY

GROUP BY

Legitimate use of the anti pattern

For ad hoc queries where you need to quickly write up some SQL that you don’t share with other developers and then discard, the anti pattern is legitimate.

Split multi join queries into smaller chunks

Type

Performance

Readability

Problem

The human brain can only store a small set of items in short term memory. That’s why we break longer pieces of information into smaller parts. This is commonly referred to as chunking, e.g. when we break up a long phone number into smaller pieces of information to remember it. The same problem exists with SQL when dealing with a large number of table joins. It gets hard for people to read, understand, and debug the code. As a rule of thumb we should not join more than five tables at a time. If possible, even reduce the number of tables you join to 3-4.

You may also get performance issues when joining a large number of tables. Each time you add a new table join, the number of possible explain plans that the CBO needs to look at grows exponentially. The optimizer might get overwhelmed as it does not have enough time to explore all of the possible plans. This may lead to poor explain plans and poor performance.

Solution

As a rule of thumb we should not join more than 5 tables together at a time. For queries with a large number of tables you should break down the query into smaller chunks that are easier to read and debug.

You can do this by using Common Table Expressions (CTE) or inline views. Let’s go through an example query with overloaded joins.

In this query we join seven tables. The query becomes hard to understand and debug. Let’s break it up into smaller chunks.

We have broken the query up into three logical parts

  • Information related to customer
  • Information to sales
  • Information to invoice

Using a CTE we can query each of the components separately for debugging purposes. Let’s look at visual representation of the query

Visual representation of the query

Let’s zoom in on customer CTE

Customer CTE

We can collapse the CTEs

Collapse CTEs

Legitimate use of the anti pattern

None

Use window functions instead of self joins

Type

Performance

Readability

Problem

Self-joins are frequently used to perform inter row calculations to look forward or backward across row boundaries in a table.

Self joins increase the number of records fetched and the number of table scans

Self-joins are hard to read and understand for most SQL developers in particular if they contain non-equi joins

Self-joins can be hard to debug

Here is an example of a self-join using a non-equi join condition (>).

Solution

Most self joins can be replaced by using window functions.

Window functions generally do not suffer from the issues of self-joins

Legitimate use of the anti pattern

In rare cases a self-join can be more performant than a window function. The more of the following conditions your query meets, the more efficiently the window function will operate compared to the self join.

  • Your self join uses a non-equi join, e.g. >, <, >=, <=. For Non-equi joins the CBO can not use a hash join, which is a very expensive operation.
  • The self-join can not make use of indexes or cluster keys on the columns involved in the join condition
  • There are no table partitions on the columns involved in the self join
  • The self join table is large and has not been reduced on one side of the join by filtering or aggregation.

Most of the time, it’s a good idea to run a performance test to compare how well the query with the self-join works to how well the window function works.

Avoid UNION for multiple filter values

Type

Performance

Readability

Problem

Instead of specifying an IN operator to filter on multiple values a UNION (ALL) query is created to scan the table for each filter value

Solution

Use the IN operator to filter on multiple values

Legitimate use of the anti pattern

None

Use UNION ALL instead of UNION

Type

Performance

Problem

UNION is used to return the unique result across two tables.

This is not only unnecessary in most cases, but it could also lead to wrong results. Also, using UNION instead of UNION ALL by accident can have a big impact on performance.

Solution

Use UNION ALL instead of UNION

Legitimate use of the anti pattern

The UNION operator is what you need if you want to get rid of duplicate records in two different data sets.

Use an alias for inline views

Type

Readability

Problem

You should follow SQL best practices and add an alias to inline views

Solution

Use an alias for your inline view

Note: Some databases will even raise an error if you don’t use an alias for an inline view.

Legitimate use of the anti pattern

None

Avoid using functions in the Join clause

Type

Performance

Problem

There are two reasons to avoid using a function on a column inside a Join clause.

  • It is CPU intensive because the function gets called for each record in your result set.
  • It prevents the database from applying an index, sort key or cluster key. As a result, the database needs to read more data from the storage layer. This can have a significant impact on query performance.

In the example below we need to apply a function to convert the order_date column with data type string to a number. The function makes any index on the order_date column unusable.

Solution

The proper solution depends on the exact nature of your scenario. Properly design your data model and make foreign keys the same data type as the parent keys, e.g. in the example above make both columns a date or number. Certain scenarios may require you to precompute the function as part of ETL or a batch job. Some databases, such as Oracle offer index based functions. Last but not least you can rewrite the query so that the function is used on the side of the join with the smaller table. This will allow the index to be used on the side of the bigger table.

Legitimate use of the anti pattern

  • In some situations, you may not be able to avoid using the function in the WHERE clause.
  • Your data volumes are small.
  • You are dealing with an ad-hoc query
  • The column with the function inside the join condition does not have an index or cluster key

Use parentheses when mixing ANDs with ORs

Type

Readability

Problem

There are a few important rules about how operators and conditions precede each other. The most important rule for conditions is that AND comes before OR.

Don’t make complicated expressions that require the reader to know the rules perfectly. Use spaces and parentheses to make the logic clear.

Solution

Use parentheses when mixing ANDs with ORs to improve readability

Please note it’s very different to:

and in the above case the logic can’t be expressed without brackets.

Legitimate use of the anti pattern

None

Avoid or delay ORDER BY in inline views

Type

Performance

Problem

As a rule of thumb you should leave ordering until the very end. Frequently it is not needed at all. Sorting data is generally an expensive operation in databases. It should be reserved for when you really need it.

There is no point in ordering the CTE customer by customerid. It does not

Solution

Avoid ORDER By in subqueries, common table expressions and inline views

If you know that your data will be used by a business intelligence tool or another client then you should leave the ordering to the client tool

Remove the ORDER BY clause

Legitimate use of anti pattern

Window functions require ORDER BY

Some databases have hacks to filter the top n number of records by using pseudo columns, e.g. Oracle

The following query will select the top 1 records from sales table ordered by sales amount. While this query works in principle it is also best avoided and replaced with a window function

WHERE NOT IN without NOT NULL check in subquery

Type

Correctness

The problem

If any row of a subquery using NOT IN returns NULL, the entire NOT IN operator will evaluate to either FALSE or UNKNOWN and no records will be returned. NOT IN will not match any rows if the subquery just returns a single row with NULL.

Let’s look at an example

We have an orders table

and an order_details table with one NULL for order_id

You create a query that returns orders in the orders table without a record in the order_details table

You might expect the following result

but you get back the following result

The query returns no rows at all. NOT IN will not match any rows if the subquery just returns a single row with NULL.

Solution

You can rewrite your query using a defensive programming style to check for NULL in the subquery

This query will give you the correct answer to your question.

Alternatively, you could use NOT EXISTS instead of NOT IN.

Exists cannot return NULL. It’s checking solely for the presence or absence of a row in the subquery and, hence, it can only return true or false. Since it cannot return NULL, there’s no possibility of a single NULL resulting in the entire expression evaluating to UNKNOWN.

While NOT and EXISTS are equivalent, NOT IN and NOT EXISTS have different behaviour when it comes to NULL.

Legitimate use of the antipattern

If you know that the column in question has a NOT NULL constraint you can leave out the extra check for NULL values in your NOT IN subquery. However, in the spirit of defensive programming you should still add it. The constraint may be dropped in the future

JOIN condition linking the compound key to multiple, different tables.

Type

Performance

Correctness

Problem

An ANSI join for a table has been specified, but the conditions specified link it to multiple, different tables simultaneously.

Example

The RentalTransactions (RT alias) table is being joined on a compound key but those keys link to more than one prior table:

Frequently, it’s an easily missed error that can lead to significant performance problems because the database can’t execute a join across several distinct tables simultaneously. While the optimizer might intervene to internally rewrite the query plan as needed, it’s typically recommended to rectify the query where feasible. It’s also prudent to join multi-key columns to the same, single primary table for both readability and consistency reasons.

Solution

Specify a join condition that consistently links the multi-keys to the same, driving table. Among the options, it’s typically the one positioned lower in the hierarchy, both in terms of the data model and the JOIN sequence. Assuming a hierarchical structure like the Snowflake Dimensional Model is in place, it would simply take swapping the key being compared from the top-level table to its equivalent at a finer granularity. To illustrate, consider how the RentalRegion is joined to the RentalDepot using the RegionID columns. In subsequent joins, the RegionID from RentalRegion can be substituted with its equivalent column from RentalDepot therefore making RT being consistently joined to the latter table only.

Legitimate use of the anti pattern

There might be some rather rare cases where a tables could be joined using different, independent keys all together, effectively creating a cycle (eg. triangle when looking at the data model). Here the transactional table can be naturally joined to its dimensions Vehicles and Users but the last two have some extra link (preferences) that potentially closes the cycle:

In the given scenario, it would be arguably justified enough to keep linked to multiple tables as shown. However, for improved readability, it is still advisable to break the loop and shift that less commonly used, additional JOIN condition to the WHERE clause as an additional filter:

Antipatterns that trigger a Notice

Beware of count distinct

Type

Performance

Problem

The Count distinct operation is computationally expensive and decreases database performance. BigQuery defaults to approximating the distinct count, using a HyperLogLog algorithm. But BigQuery also supports exact distinct count in a separate function.

One of two algorithms is typically used for distinct counts.

  • Sorting: Records are sorted, duplicate records then are identified and then removed. The number of target expression values in the remaining set of tuples would then represent a distinct count of the target expression values.
  • Hashing: Records are hashed and then duplicates are removed within each hash partition. Implementing hashing may be less computationally intensive than the sorting approach and may scale better with a sufficiently large set of records. However, it is still computationally expensive

Solution

There are two possible solutions to the problem.

If you don’t require 100% accuracy you can use approximate count distinct aka HyperLogLog. Most modern databases support this functionality.

Here is an example of approximate count distinct in SQL Server

Bitmap-based distinct counts can be performed incrementally, combining them with Snowflake Materialized Views for query rewrite in databases that support such functionality. Snowflake and Oracle are two examples of databases that support bitmap-based distinct counts.

If you do not have an approximate count distinct (HyperLogLog) or bitmap based count distinct as part of your database, you could precompute the distinct counts as part of ETL.

Legitimate use of the anti pattern

For small data volumes you will not see a performance problem when using count distinct.

As the number of distinct values increases, the elapsed time and memory usage of standard count distinct increases significantly (non-linear growth).

The elapsed time and memory usage of approximate count distinct on the other hand remains consistent regardless of the number of distinct values in the data set (linear growth).

Beware of SELECT DISTINCT

Type

Performance

Correctness

Problem

Incorrect use of the DISTINCT operator by SQL developers and analysts is common.

In the vast majority of query scenarios where it is used, using DISTINCT is problematic.

SELECT DISTINCT is used incorrectly to work around the following types of problems:

  • Data quality issues in the underlying data
  • Data quality issues because of a badly designed data model
  • Mistake in the SQL statement, e.g. Join or Filtering
  • Incorrect usage of SQL features

Solution

When you have duplicate values, resist the urge to use DISTINCT.

When you find duplicates in your data, instead of using DISTINCT, you should first make sure you understand the data model and the data itself.

  • Make sure to check that your table joins are correct. Do you need to join on compound keys instead of single key columns?
  • Make sure to apply the correct filters.
  • Make sure that you understand the data model
  • Make sure to check for data quality issues

Legitimate use of the anti pattern

There are times when it makes sense to use DISTINCT. As a general rule, the pattern is correct if your goal is to find the unique values in one or more columns. But if you use DISTINCT to get rid of a large number of duplicates that you didn’t expect, you’ve hit an anti pattern.

Further reading

SQL antipatterns: SELECT DISTINCT

Pseudo join condition results in a CROSS join.

Type

Performance

Correctness

Problem

A join condition exists but it does not link the joining table resulting in a cross join.

Example

Some conditions have been specified following the ON clause. However, none of them are actually joining: the first condition uses a constant (1=1). The second join condition does not reference the joining table PCE.

This results in an implicit cross join and performance issues.

Solution

Specify a join condition that links the joining table to another table defined earlier in the scope.

Legitimate use of the anti pattern

None. Once a table is explicitly joined (JOIN with ON keyword used) at least one of the condition expressions should bind the table being joined with at least one table specified before.

Avoid nested CTEs

Type

Readability

Problem

It is possible to nest CTEs, but that does not mean you should write queries with them. Nesting CTEs can make the SQL query harder to read and understand, especially when multiple levels of nesting are involved.

Solution

Instead of nesting CTEs, list them sequentially in the WITH clause. This makes the query more readable and easier to maintain.

Legitimate use of the anti pattern

None

How can we automatically identify SQL antipatterns?

You now understand the importance of fixing SQL antipatterns. But how do we find SQL antipatterns in SQL code?

We have created FlowHigh, which is an SQL analysis tool for static and dynamic SQL analysis.

We offer three distinct versions of our FlowHigh SQL Analyzer. The primary version is integrated into our online FlowHigh platform. The second version is available as a Python SDK. Lastly, we have a variation designed specifically for Snowflake Native apps.

FlowHigh UI

You can identify antipatterns using the FlowHigh UI in your web browser. In this flavour, you can easily spot query lines that contain antipatterns and manually review them.

Register for FlowHigh to check your queries for bad SQL code.

FlowHigh SDK

We have also created the FlowHigh SDK. This is a python SDK package for programmatic access. One of the many use cases is to automate the detection of bad SQL.

We can detect anti patterns by running this python code:

For this example we will run this script on this query:

Output:

AP_ID

AP_NAME

SEVERITY

PERFORMANCE_FLAG

READABILITY_FLAG

CORRECTNESS_FLAG

AP_13

Avoid inner join after outer join in multi-join query

Warning

N

N

Y

AP_06

Beware of SELECT DISTINCT

Warning

Y

N

Y

FlowHigh Snowflake Native App

FlowHigh SQL Analyser is also available as Snowflake Native App, which enables us to detect antipatterns within Snowflake. It can analyze Snowflake’s query history to identify problematic SQL queries and schedule regular scans, running them every few minutes.

Using this sample code, we can pinpoint anti-patterns and determine the frequency of their occurrence within queries.

To learn more about FlowHigh SQL Analyser Native App you can check our blog where we explain how to prevent surges in your Snowflake compute usage.

Conclusion

Avoiding these and other anti-patterns in SQL can lead to more efficient, reliable, and maintainable database operations. Always aim for clarity, efficiency, and accuracy to ensure optimal query performance.

Uli Bethke

About the author:

Uli Bethke

Co-founder of Sonra

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.