SQL,

Home » SQL » Window Function ROWS and RANGE on Redshift and BigQuery

Window Function ROWS and RANGE on Redshift and BigQuery

by Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.

Published on August 22, 2017
Updated on December 18, 2024

Frames in window functions allow us to operate on subsets of the partitions by breaking the partition into even smaller sequences of rows. SQL provides syntax to express very flexible definitions of a frame. We described the syntax in the first post on Window functions and demonstrated some basic use cases in the post on Data Exploration with Window Functions and Advanced Use Cases of Window Functions. So far we always defined the frame by the ROWS clause and the frame borders followed first, last or current row. In this post, we will introduce fixed-size frame and RANGE clause as an alternative to ROWS clause. Since Redshift does not support the RANGE clause yet, we will demonstrate this feature on Google BigQuery database.

Your subscription could not be saved. Please try again.

You're In! Welcome to FastForward Congratulations on successfully subscribing to the FastForward Data Engineering Newsletter! You're now part of a growing community of 15,000+ data engineers who are staying ahead in the ever-evolving world of data.

FlowForward.

All Things Data Engineering
Straight to Your Inbox!

Frame defined by ROWS

Every time we work with temporal data and we need to compute some value based on other values that are within a precise time unit from the current one, we choose a fixed-size moving frame. For example, in case of a stock market or weather, we only care about the few previous days when comparing to the current exchange rate or temperature. We will demonstrate the fixed-size frame on alerts of mobile internet usage.
[rs-banner]
We reuse our working dataset from the post on Data Exploration with Window Functions, which contains phone calls and internet data usage measured in kB of two users. You can download the dataset here. We will consider only internet usage and filter out the phone calls.

SELECT

user_id

,date_time

,data

FROM calls

WHERE data IS NOT null

ORDER BY date_time;

Let’s see a sample:

user_id	date_time	data
1	2016-06-22 16:11:30	22
1	2016-06-22 16:12:16	25
1	2016-06-22 16:13:39	2633
1	2016-06-22 21:16:29	337
1	2016-06-22 22:41:59	21
1	2016-06-23 08:38:17	28
1	2016-06-23 09:01:47	70900
2	2016-06-23 17:51:44	9
2	2016-06-23 18:02:56	10
2	2016-06-23 18:06:01	9
2	2016-06-23 18:19:31	887
2	2016-06-24 08:30:21	34
2	2016-06-24 08:31:04	340000
1	2016-06-24 09:06:44	6310

We want to be notified about unusually large data usages. Let’s say, every time the usage is larger than a total of the last five day’s usage. Thus, the scope of interest is the previous five usages in the sequence ordered by date and time. One usage corresponds to one row in our data and so we will define a frame of fixed size 5 by means of the ROWS clause.

SELECT

user_id

,date_time

,data

,data > COALESCE(SUM(data) OVER (PARTITION BY user_id ORDER BY date_time

ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING), 0) AS is_alert

FROM calls

WHERE data IS NOT null

ORDER BY date_time;

We kept the original attributes and added one of a boolean type, which determines if the alert applies. The window function SUM takes care of calculating the total and the ROWS clause takes care of the frame borders: the frame starts at the fifth row preceding the current one and ends at the previous row (we do not want to include the current row). Furthermore, we have to check for the null values that initiate from an empty frame (first row of each customer). We want the total to be zero if the frame is empty, which is exactly what the COALESCE function does. The output follows:

user_id	date_time	data	is_alert
1	2016-06-22 16:11:30	22	TRUE
1	2016-06-22 16:12:16	25	TRUE
1	2016-06-22 16:13:39	2633	TRUE
1	2016-06-22 21:16:29	337	FALSE
1	2016-06-22 22:41:59	21	FALSE
1	2016-06-23 8:38:17	28	FALSE
1	2016-06-23 9:01:47	70900	TRUE
2	2016-06-23 17:51:44	9	TRUE
2	2016-06-23 18:02:56	10	TRUE
2	2016-06-23 18:06:01	9	FALSE
2	2016-06-23 18:19:31	887	TRUE
2	2016-06-24 8:30:21	34	FALSE
2	2016-06-24 8:31:04	340000	TRUE
1	2016-06-24 9:06:44	6310	FALSE
⋮

The following code filters only alerts, which produces the final output:

SELECT

user_id

,date_time

,data

FROM (SELECT

user_id

,date_time

,data

,data > COALESCE(SUM(data) OVER (PARTITION BY user_id ORDER BY date_time

ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING), 0) AS is_alert

FROM calls

WHERE data IS NOT null) AS alerts_flagmap

WHERE is_alert = TRUE

ORDER BY date_time;

Creating a boolean attribute by window function is a simple way how to “cherry-pick” rows with some specific property from the dataset. The table below contains only qualifying alerts according to our rule. Note that it is easy to change the requirement to 10 or 100 preceding rows by altering just one number in the query.

user_id	date_time	data
1	2016-06-22 16:11:30	22
1	2016-06-22 16:12:16	25
1	2016-06-22 16:13:39	2633
1	2016-06-23 9:01:47	70900
2	2016-06-23 17:51:44	9
2	2016-06-23 18:02:56	10
2	2016-06-23 18:19:31	887
2	2016-06-24 8:31:04	340000
1	2016-06-25 17:50:51	14980
2	2016-06-28 7:57:09	1159600
⋮

Frame defined by RANGE

As long as we want to aggregate over individual entries from the table, the ROWS clause is the right choice. Now imagine that you want to trigger an alert every time the current usage exceeds the total usage over the past 24 hours. The time frame of the previous 24 hours could include 50 rows, 1 row or none. A seemingly correct solution is to aggregate and sum the usage by 24 hours and use the LAG function, however, it does not produce the expected output. The time frame should be the last 24 hours, not the previous day. Let’s show how the RANGE clause is made just for this use case.
[rs-button]
The RANGE clause limits the frame to contain rows that have its value within the specified range, relative to the current value. It operates logically on values, whereas the ROWS clause operates physically on rows of the table. Before we apply the RANGE clause to our use case, it is important to understand how the frame is defined on the following small sample. Let’s have a table temp with a single column num and values 1,2,5,5.

SELECT

num

,COUNT(*) OVER (ORDER BY num ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS rows_count

,COUNT(*) OVER (ORDER BY num RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) AS range_count

,SUM(num) OVER (ORDER BY num ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rows_sum

,SUM(num) OVER (ORDER BY num RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) AS range_sum

FROM temp

ORDER BY num;

We created four values in a temporary table temp and calculated COUNT and SUM over a fixed-size frame bordered by the second before current and the current row/value. You can compare how the results differ for ROWS and RANGE clauses:

num	rows_count	range_count	rows_sum	range_sum
1	1	1	1	1
2	2	2	3	3
5	3	2	8	10
5	3	2	12	10

The COUNT for the ROWS must be always 3 except for the first two rows since the frame contains the row before previous (1.), the previous (2.) and the current (3.). The situation is more dynamic for the RANGE clause. Here, the query engine subtracts 2 from the current value and looks for the rows in the range from this number to the current value. For example, at the third row, the range is (5 – 2, 5) = (3,5) and only the last two rows (with value 5) have the num value in this interval, therefore the count is 2. If you understand this idea then the SUM columns should be no surprise.
The CURRENT ROW together with the RANGE clause is often a source of misunderstanding because it behaves differently from ROWS on multiple equal values in the sequence. Since the RANGE version substitutes CURRENT ROW for the value 5 in the example above, it understands the frame “up to 5”, and therefore, all rows containing the value 5 are present in the frame, regardless of how many rows before or after the current one appear in the sequence.
[cloud_book_banner]
Unfortunately, Redshift developers have not yet implemented the RANGE clause and PostgreSQL does not allow to customize the frame by any values other than UNBOUNDED or CURRENT ROW. The capabilities are then very similar to the ROWS clause, however, the one difference is the behaviour for multiple equal values in the sequence, which are treated in a little different way for RANGE and ROWS, as we have seen earlier.
As a consequence, we will use Goo g le BigQuery engine to explore capabilities of the RANGE clause. The following table presents RANGE support among the aforementioned three databases and Oracle, which provides the full support. In the following post, we will introduce much more features of the window functions and compare them among all kinds of databases.

	RANGE clause	numeric values	date values
Redshift	✘	✘	✘
PostgreSQL	✓	✘	✘
BigQuery	✓	✓	✘
Oracle	✓	✓	✓

Let’s return to our use case of the internet usage. We will stick to the idea of the past 24 hours: alert is triggered every time the current usage exceeds the total usage over the past 24 hours. Now we know that the easiest way how to achieve it is to use the RANGE clause. The BigQuery supports any numeric values inside RANGE clause, however, we cannot use any others, such as date or timestamp. As we are using date type in our use case, we cannot put it in the statement directly. As a workaround, we will use a function UNIX_SECONDS, which converts the timestamp into the integer of seconds in the Unix time. Next, we define the frame as 24 hours in seconds, which is 60 * 60 * 24 = 86400.

SELECT

user_id

,date_time

,data

,data > COALESCE(SUM(data) OVER (ORDER BY UNIX_SECONDS(date_time)

RANGE BETWEEN 86400 PRECEDING AND 1 PRECEDING), 0) AS is_alert

FROM calls

WHERE data IS NOT null

ORDER BY date_time;

Again, we want to leave out the current usage from the sum, therefore, we use 1 PRECEDING as the end of the frame. Let’s see the output:

user_id	date_time	data	is_alert
1	2016-06-22 16:11:30	22	TRUE
1	2016-06-22 16:12:16	25	TRUE
1	2016-06-22 16:13:39	2633	TRUE
1	2016-06-22 21:16:29	337	FALSE
1	2016-06-22 22:41:59	21	FALSE
1	2016-06-23 08:38:17	28	FALSE
1	2016-06-23 09:01:47	70900	TRUE
2	2016-06-23 17:51:44	9	TRUE
2	2016-06-23 18:02:56	10	TRUE
2	2016-06-23 18:06:01	9	FALSE
2	2016-06-23 18:19:31	887	TRUE
2	2016-06-24 08:30:21	34	FALSE
2	2016-06-24 08:31:04	340000	TRUE
1	2016-06-24 09:06:44	6310	TRUE
⋮

Note the last row, which is now true, as the last 24 hours does not even cover the previous usage of the customer with id 1, which was at 9:01:47 on 6/23. Therefore it is his large data usage after a long time, thus considered as an alert. Whereas in the ROWS variant, the sum was computed from the previous five rows, which reach more than a day into the past and so alert was not triggered.
The following query wraps the previous output and filters only the rows with the positive alert flag so that we can see only the alerts.

SELECT

user_id

,date_time

,data

FROM (SELECT

user_id

,date_time

,data

,data > SUM(data) OVER (ORDER BY UNIX_SECONDS(date_time)

RANGE BETWEEN 86400 PRECEDING AND 1 PRECEDING) AS is_alert

FROM calls

WHERE data IS NOT null) AS alerts_flagmap

WHERE is_alert = TRUE

ORDER BY date_time;

Output:

user_id	date_time	data
1	2016-06-22 16:11:30	22
1	2016-06-22 16:12:16	25
1	2016-06-22 16:13:39	2633
1	2016-06-23 09:01:47	70900
2	2016-06-23 17:51:44	9
2	2016-06-23 18:02:56	10
2	2016-06-23 18:19:31	887
2	2016-06-24 08:31:04	340000
1	2016-06-24 09:06:44	6310
1	2016-06-24 15:50:44	39088
⋮

About the author:

Uli Bethke

Co-founder of Sonra

Any questions or comments for Uli? Connect with him on LinkedIn.

Follow Uli Bethke:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD