Querying hierarchical data in Snowflake

July 4, 2020

Overview

Having the right tools for the job makes the life of a data engineer a lot easier. It is also more fun to work with a platform that has support for a wide variety of use cases.
More importantly it also has a direct impact on the productivity of data engineers, which is reflected in the overall TCO of a data platform.
In this blog post we look at how we can handle hierarchical data on the Snowflake data platform. Working with hierarchical data is also known as recursion or recursive query.
Unlike any of the other big four cloud data warehouse vendors, Snowflake supports recursive queries. Working with hierarchical data is a common scenario. Not being able to use recursion has a direct impact on the productivity of data engineers and your bottom line.

Table of Contents

What is a recursive query?

Let’s look at an example. We have an employee table that contains the hierarchy of our organisation aka the org chart.

On a platform that does not support recursive queries the data engineers need to apply workarounds to drill into the hierarchical structure. They would have to use self-joins (yikes). This will result in poor performance and limited functionality: you will not be able to drill to an arbitrary level of depth in your hierarchy.
Recursion is a feature that is used in almost every coding language that exists. When the data platform offers recursion you can drill into an arbitrary level of depth in a hierarchy with very good performance.
You will also decrease the number of lines of code you need to write and the code is a lot cleaner. In summary you get the following benefits:

Cleaner and more readable code.
Very good performance. At every new iteration it only joins the rows added by the previous iteration. The manual alternative would be to orchestrate to run a series of self-joins in some external loop but would be both a lot more complex and a lot less performant (full scans).
Support for an arbitrary level of depth in the hierarchy.
Better productivity and less overhead and workarounds

Recursive queries on Snowflake

Snowflake offers two options to write recursive queries

WITH CTE

The WITH clause in SQL was created as a clause with “Statement Scoped Views”, which unlike traditional SQL views, only has scope in the query in which it is being used. It precedes a select statement. It is used to define CTE(‘s) inside a select statement.
A CTE or a Common Table Expression is a subquery with an alias or a name inside a select statement. It is analogous to a temporary view but one which only exists in scope of the query that defines it.
CTE Query Syntax:
Standard:

Recursive:

Parts of recursive CTE query:

WITH RECURSIVE recursive_CTE :
1. The temporary view(CTE) name.
2. Recursive is an optional term which can be used to distinguish a recursive CTE, as all CTE’s by default are recursive if the query inside the CTE is of recursive nature.
<cte_column_list> :
1. An optional list of column names.
2. If it is not added then the query definition inside must have distinct names/aliases defined for the resulting columns.
Anchor Clause :
1. This part contains the SELECT to show the initial rows selected identifying the top of the hierarchy like the root of a tree.
2. For example, start at the “President” of the organisation and move down.
Recursive Clause :
1. This SELECT statement will select the next level of the hierarchy based on the previous layer.
2. For the first iteration it will select the root output of the anchor clause and for the subsequent iterations the previous layer is the most recent iteration output.
SELECT :
1. A query expression (i.e. a SELECT statement).
2. This is a flexible feature in CTE. For example, we can select all or some data from the CTE, select only the last level in the hierarchy, apply a window function or maybe group by specific column.

In the following section we will look at the usage of Snowflake’s CTE feature with a small sample hierarchical employee dataset:

Employee ID
Employee Name
Department
Start Date
Title
Manager ID (Employee ID of the employee to whom the employee reports)

Examples
Query 1: Create a Customer CTE from employee table without recursion and select details from the CTE.

WITH customerCTE (CustomerID, CustomerName, Date_of_Joining) as(
SELECT employeeID
,employeeName
,startdate
FROM employee
WHERE employeeID > 7
)
SELECT *
FROM customerCTE
WHERE customerID > 8
ORDER BY Date_of_Joining desc;

WITH customerCTE (CustomerID, CustomerName, Date_of_Joining) as(

SELECT employeeID

,employeeName

,startdate

FROM employee

WHERE employeeID > 7

)

SELECT *

FROM customerCTE

WHERE customerID > 8

ORDER BY Date_of_Joining desc;

The above query shows how we can create a CTE with the data from the employee table without using recursion and the same can be referenced by a select query. The SELECT query would select relevant information from the temporary view/CTE as if it were a table.
Output:

Query 2: Using a CTE to visually describe the hierarchical nature of the data with a recursive CTE.

WITH management_data (visual_delimtr, employeeid, manager_id, title,staff_hierarchy_level) AS (
SELECT ' 'AS visual_delimtr
,employeeid
,manager_id
,title
,1 AS Staff_Hierarchy_Level
FROM employee
WHERE title = 'President'
UNION
ALL
SELECT visual_delimtr
|| '--> '
,employee.employeeid
,employee.manager_id
,employee.title
,staff_hierarchy_level + 1
FROM employee
JOIN management_data
ON employee.manager_id = management_data.employeeid)
SELECT visual_delimtr || title AS employee_title
,employeeid
,manager_id
,title
,staff_hierarchy_level
FROM management_data;

WITH management_data (visual_delimtr, employeeid, manager_id, title,staff_hierarchy_level) AS (

SELECT ' 'AS visual_delimtr

,employeeid

,manager_id

,title

,1 AS Staff_Hierarchy_Level

FROM employee

WHERE title = 'President'

UNION

ALL

SELECT visual_delimtr

|| '--> '

,employee.employeeid

,employee.manager_id

,employee.title

,staff_hierarchy_level + 1

FROM employee

JOIN management_data

ON employee.manager_id = management_data.employeeid)

SELECT visual_delimtr || title AS employee_title

,employeeid

,manager_id

,title

,staff_hierarchy_level

FROM management_data;

Let’s break down the above query:

WITH recursive management_data : Name of the recursive CTE.
<visual_delimtr, employeeid, manager_id, title, Staff_Hierarchy_Level> : The column list.
Anchor Clause : The first select is the anchor clause which says that ‘President’ is the top of the hierarchy.
Recursive Clause : The second select is the recursive clause which builds on the data gathered by the previous iteration.
SELECT : The final select is then used to fetch the complete information from the completed CTE.
The visual delimiter “visual_delimtr” is used to show the separation in the titles to make it clearer as to the tier of management in the organisation.

Output:

Query 3: Using CTE to create a Fibonacci Sequence. CTE can thus be used as a traditional loop function present in any other programming language.

WITH RECURSIVE FibonacciNumbers (rec_level, fib_number, next_number) AS (
SELECT 0 as rec_level
,1 as fib_number
,1 as next_number
UNION
ALL
SELECT rec_level + 1 as rec_level
,next_number as fib_number
,fib_number + next_number as next_number
FROM FibonacciNumbers
WHERE rec_level < 10)
SELECT rec_level as Fibonacci_Ordinal_Level
,fib_number as Fibonacci_Sequence
FROM FibonacciNumbers;

WITH RECURSIVE FibonacciNumbers (rec_level, fib_number, next_number) AS (

SELECT 0 as rec_level

,1 as fib_number

,1 as next_number

UNION

ALL

SELECT rec_level + 1 as rec_level

,next_number as fib_number

,fib_number + next_number as next_number

FROM FibonacciNumbers

WHERE rec_level < 10)

SELECT rec_level as Fibonacci_Ordinal_Level

,fib_number as Fibonacci_Sequence

FROM FibonacciNumbers;

Output:

This is very useful to explode an existing data set, e.g. you can use it to create a date dimension with a single SQL statement.

CONNECT BY

Snowflake’s CONNECT BY is another example of a clause that makes sorting hierarchical data easy. It is a subclause of the FROM clause in a select statement. CONNECT BY is the traditional recursive clause and has been with Oracle SQL for a long time, even before the WITH clause became part of ANSI SQL.
Examples
Query 1: To find the management hierarchy in our employee table using CONNECT BY clause.

select
employeeid,
manager_id,
title
from
employee start with title = 'President' connect by manager_id = prior employeeid
order by
employeeid;

select

employeeid,

manager_id,

title

from

employee start with title = 'President' connect by manager_id = prior employeeid

order by

employeeid;

Let’s break down the query to understand its parts:

The “starts with” phrase, also known as a predicate is to identify the first level of the hierarchy.
CONNECT BY clause is like a JOIN clause which connects the current level of management (manager_id) with the previous level’s data (employee ID of next senior manager).

Output:

Query 2: Visualise the hierarchical structure of the data in the table for every employee.

select
sys_connect_by_path(title, ' -> ') as Management_Level,
employeename,
employeeid,
manager_id as reporting_manager_id,
title
from
employee start with title = 'President' connect by manager_id = prior employeeid
order by
employeeid;

select

sys_connect_by_path(title, ' -> ') as Management_Level,

employeename,

employeeid,

manager_id as reporting_manager_id,

title

from

employee start with title = 'President' connect by manager_id = prior employeeid

order by

employeeid;

In the above query the additional element is the “sys_connect_by_path” function which is used to print the leaf to root connection in the hierarchy.
Output:

Comparison WITH CTE vs CONNECT BY

So what should you use WITH CTE or CONNECT BY

Simplicity of Code –
1. If you are already familiar with the CONNECT BY syntax from Oracle then it will be handy to use in simple scenarios and will be quicker and shorter to use.
2. CTE falls naturally into SQL set based paradigm where recursive CTE can be analogous to a temporary table that keeps appending the result of subsequent iterations of a query and then at the end of it, you print a result (outerblock). Once you get familiar with it, CONNECT BY would become a stranger.
Scope of Usage – CTE can handle a lot more complex cases and scenarios. It is a lot more flexible than CONNECT BY for example you can use JOINS inside a CTE. Snowflake CONNECT BY does not support some Oracle clauses:
1. NOCYCLE
2. CONNECT_BY_ISCYCLE
3. CONNECT_BY_ISLEAF
Data Modification – When you have to perform aggregates or modification to data assembled by the clause it is easily done in CTE but very difficult and complex to manage within CONNECT BY as it is basically a read only command.
ANSI compliant – CTE is ANSI SQL compliant whereas CONNECT BY is not.

The PRIOR keyword in CONNECT BY let’s us choose the column values from the previous iteration whereas CTE can let us define table or CTE name to let us define which values are to be included in the current iteration and the previous iteration. It gives CTE more usage scope over CONNECT BY.
Overall, when you want to perform recursive operations on maybe one or two different tables and combine them together to perform some aggregations, data modifications etc. CTE is the way to go. Taking into account that CTE is ANSI compliant, it would be the preferred way to go for recursive queries in Snowflake.

Conclusion

As of the date of this post, Snowflake is the only cloud data warehouse platform that supports recursive queries. Recursion is a widely used pattern in programming. Support for recursive queries increases productivity of data engineers and makes queries run more efficiently, which decreases the overall load on the platform and reduces cost. Support for recursive queries has a direct impact on your bottom line.

Enjoyed this post? Have a look at the other posts on our blog.

We created the content in partnership with Snowflake.

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

Querying hierarchical data in Snowflake

Overview

What is a recursive query?

Recursive queries on Snowflake

WITH CTE

CONNECT BY

Comparison WITH CTE vs CONNECT BY

Conclusion

Querying hierarchical data in Snowflake

Overview

What is a recursive query?

Recursive queries on Snowflake

WITH CTE

CONNECT BY

Comparison WITH CTE vs CONNECT BY

Conclusion

Related Articles

Benchmarking Snowflake Native Apps

Optimal Usage of Snowflake Virtual Warehouses: Breaking Down Parallelism & Concurrency

7 guardrails against common mistakes that inflate Snowflake credit usage

Cookies consent