Querying hierarchical data in Snowflake

July 4, 2020

Overview

Having the right tools for the job makes the life of a data engineer a lot easier. It is also more fun to work with a platform that has support for a wide variety of use cases.
More importantly it also has a direct impact on the productivity of data engineers, which is reflected in the overall TCO of a data platform.
In this blog post we look at how we can handle hierarchical data on the Snowflake data platform. Working with hierarchical data is also known as recursion or recursive query.
Unlike any of the other big four cloud data warehouse vendors, Snowflake supports recursive queries. Working with hierarchical data is a common scenario. Not being able to use recursion has a direct impact on the productivity of data engineers and your bottom line.

What is a recursive query?

Let’s look at an example. We have an employee table that contains the hierarchy of our organisation aka the org chart.

On a platform that does not support recursive queries the data engineers need to apply workarounds to drill into the hierarchical structure. They would have to use self-joins (yikes). This will result in poor performance and limited functionality: you will not be able to drill to an arbitrary level of depth in your hierarchy.
Recursion is a feature that is used in almost every coding language that exists. When the data platform offers recursion you can drill into an arbitrary level of depth in a hierarchy with very good performance.
You will also decrease the number of lines of code you need to write and the code is a lot cleaner. In summary you get the following benefits:

  1. Cleaner and more readable code.
  2. Very good performance. At every new iteration it only joins the rows added by the previous iteration. The manual alternative would be to orchestrate to run a series of self-joins in some external loop but would be both a lot more complex and a lot less performant (full scans).
  3. Support for an arbitrary level of depth in the hierarchy.
  4. Better productivity and less overhead and workarounds

Recursive queries on Snowflake

Snowflake offers two options to write recursive queries

WITH CTE

The WITH clause in SQL was created as a clause with “Statement Scoped Views”, which unlike traditional SQL views, only has scope in the query in which it is being used. It precedes a select statement. It is used to define CTE(‘s) inside a select statement.
A CTE or a Common Table Expression is a subquery with an alias or a name inside a select statement. It is analogous to a temporary view but one which only exists in scope of the query that defines it.
CTE Query Syntax:
Standard:

Recursive:

Parts of recursive CTE query:

  1. WITH RECURSIVE recursive_CTE :
    1. The temporary view(CTE) name.
    2. Recursive is an optional term which can be used to distinguish a recursive CTE, as all CTE’s by default are recursive if the query inside the CTE is of recursive nature.
  2. <cte_column_list> :
    1. An optional list of column names.
    2. If it is not added then the query definition inside must have distinct names/aliases defined for the resulting columns.
  3. Anchor Clause :
    1. This part contains the SELECT to show the initial rows selected identifying the top of the hierarchy like the root of a tree.
    2. For example, start at the “President” of the organisation and move down.
  4. Recursive Clause :
    1. This SELECT statement will select the next level of the hierarchy based on the previous layer.
    2. For the first iteration it will select the root output of the anchor clause and for the subsequent iterations the previous layer is the most recent iteration output.
  5. SELECT :
    1. A query expression (i.e. a SELECT statement).
    2. This is a flexible feature in CTE. For example, we can select all or some data from the CTE, select only the last level in the hierarchy, apply a window function or maybe group by specific column.

In the following section we will look at the usage of Snowflake’s CTE feature with a small sample hierarchical employee dataset:

  1. Employee ID
  2. Employee Name
  3. Department
  4. Start Date
  5. Title
  6. Manager ID (Employee ID of the employee to whom the employee reports)

Examples
Query 1: Create a Customer CTE from employee table without recursion and select details from the CTE.

The above query shows how we can create a CTE with the data from the employee table without using recursion and the same can be referenced by a select query. The SELECT query would select relevant information from the temporary view/CTE as if it were a table.
Output:

Query 2: Using a CTE to visually describe the hierarchical nature of the data with a recursive CTE.

Let’s break down the above query:

  1. WITH recursive management_data : Name of the recursive CTE.
  2. <visual_delimtr, employeeid, manager_id, title, Staff_Hierarchy_Level> : The column list.
  3. Anchor Clause : The first select is the anchor clause which says that ‘President’ is the top of the hierarchy.
  4. Recursive Clause : The second select is the recursive clause which builds on the data gathered by the previous iteration.
  5. SELECT : The final select is then used to fetch the complete information from the completed CTE.
  6. The visual delimiter “visual_delimtr” is used to show the separation in the titles to make it clearer as to the tier of management in the organisation.

Output:

Query 3: Using CTE to create a Fibonacci Sequence. CTE can thus be used as a traditional loop function present in any other programming language.

Output:

This is very useful to explode an existing data set, e.g. you can use it to create a date dimension with a single SQL statement.

CONNECT BY

Snowflake’s CONNECT BY is another example of a clause that makes sorting hierarchical data easy. It is a subclause of the FROM clause in a select statement. CONNECT BY is the traditional recursive clause and has been with Oracle SQL for a long time, even before the WITH clause became part of ANSI SQL.
Examples
Query 1: To find the management hierarchy in our employee table using CONNECT BY clause.

Let’s break down the query to understand its parts:

  1. The “starts with” phrase, also known as a predicate is to identify the first level of the hierarchy.
  2. CONNECT BY clause is like a JOIN clause which connects the current level of management (manager_id) with the previous level’s data (employee ID of next senior manager).

Output:

Query 2: Visualise the hierarchical structure of the data in the table for every employee.

In the above query the additional element is the “sys_connect_by_path” function which is used to print the leaf to root connection in the hierarchy.
Output:

Comparison WITH CTE vs CONNECT BY

So what should you use WITH CTE or CONNECT BY

  1. Simplicity of Code
    1. If you are already familiar with the CONNECT BY syntax from Oracle then it will be handy to use in simple scenarios and will be quicker and shorter to use.
    2. CTE falls naturally into SQL set based paradigm where recursive CTE can be analogous to a temporary table that keeps appending the result of subsequent iterations of a query and then at the end of it, you print a result (outerblock). Once you get familiar with it, CONNECT BY would become a stranger.
  2. Scope of Usage – CTE can handle a lot more complex cases and scenarios. It is a lot more flexible than CONNECT BY for example you can use JOINS inside a CTE. Snowflake CONNECT BY does not support some Oracle clauses:
    1. NOCYCLE
    2. CONNECT_BY_ISCYCLE
    3. CONNECT_BY_ISLEAF
  3. Data Modification – When you have to perform aggregates or modification to data assembled by the clause it is easily done in CTE but very difficult and complex to manage within CONNECT BY as it is basically a read only command.
  4. ANSI compliant – CTE is ANSI SQL compliant whereas CONNECT BY is not.

The PRIOR keyword in CONNECT BY let’s us choose the column values from the previous iteration whereas CTE can let us define table or CTE name to let us define which values are to be included in the current iteration and the previous iteration. It gives CTE more usage scope over CONNECT BY.
Overall, when you want to perform recursive operations on maybe one or two different tables and combine them together to perform some aggregations, data modifications etc. CTE is the way to go. Taking into account that CTE is ANSI compliant, it would be the preferred way to go for recursive queries in Snowflake.

Conclusion

As of the date of this post, Snowflake is the only cloud data warehouse platform that supports recursive queries. Recursion is a widely used pattern in programming. Support for recursive queries increases productivity of data engineers and makes queries run more efficiently, which decreases the overall load on the platform and reduces cost. Support for recursive queries has a direct impact on your bottom line.

Enjoyed this post? Have a look at the other posts on our blog.

Contact us for Snowflake professional services.

We created the content in partnership with Snowflake.