Window Function ROWS and RANGE on Redshift and BigQuery
Frames in window functions allow us to operate on subsets of the partitions by breaking the partition into even smaller sequences of rows. SQL provides syntax to express very flexible definitions of a frame. We described the syntax in the first post on Window functions and demonstrated some basic use cases in the post on Data Exploration with Window Functions and Advanced Use Cases of Window Functions. So far we always defined the frame by the ROWS clause and the frame borders followed first, last or current row. In this post, we will introduce fixed-size frame and RANGE clause as an alternative to ROWS clause. Since Redshift does not support the RANGE clause yet, we will demonstrate this feature on Google BigQuery database.
FlowForward.
All Things Data Engineering
Straight to Your Inbox!
Frame defined by ROWS
Every time we work with temporal data and we need to compute some value based on other values that are within a precise time unit from the current one, we choose a fixed-size moving frame. For example, in case of a stock market or weather, we only care about the few previous days when comparing to the current exchange rate or temperature. We will demonstrate the fixed-size frame on alerts of mobile internet usage.
[rs-banner]
We reuse our working dataset from the post on Data Exploration with Window Functions, which contains phone calls and internet data usage measured in kB of two users. You can download the dataset here. We will consider only internet usage and filter out the phone calls.
1 2 3 4 5 6 7 |
SELECT user_id ,date_time ,data FROM calls WHERE data IS NOT null ORDER BY date_time; |
Let’s see a sample:
user_id |
date_time |
data |
---|---|---|
1 |
2016-06-22 16:11:30 |
22 |
1 |
2016-06-22 16:12:16 |
25 |
1 |
2016-06-22 16:13:39 |
2633 |
1 |
2016-06-22 21:16:29 |
337 |
1 |
2016-06-22 22:41:59 |
21 |
1 |
2016-06-23 08:38:17 |
28 |
1 |
2016-06-23 09:01:47 |
70900 |
2 |
2016-06-23 17:51:44 |
9 |
2 |
2016-06-23 18:02:56 |
10 |
2 |
2016-06-23 18:06:01 |
9 |
2 |
2016-06-23 18:19:31 |
887 |
2 |
2016-06-24 08:30:21 |
34 |
2 |
2016-06-24 08:31:04 |
340000 |
1 |
2016-06-24 09:06:44 |
6310 |
We want to be notified about unusually large data usages. Let’s say, every time the usage is larger than a total of the last five day’s usage. Thus, the scope of interest is the previous five usages in the sequence ordered by date and time. One usage corresponds to one row in our data and so we will define a frame of fixed size 5 by means of the ROWS clause.
1 2 3 4 5 6 7 8 9 |
SELECT user_id ,date_time ,data ,data > COALESCE(SUM(data) OVER (PARTITION BY user_id ORDER BY date_time ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING), 0) AS is_alert FROM calls WHERE data IS NOT null ORDER BY date_time; |
We kept the original attributes and added one of a boolean type, which determines if the alert applies. The window function SUM takes care of calculating the total and the ROWS clause takes care of the frame borders: the frame starts at the fifth row preceding the current one and ends at the previous row (we do not want to include the current row). Furthermore, we have to check for the null values that initiate from an empty frame (first row of each customer). We want the total to be zero if the frame is empty, which is exactly what the COALESCE function does. The output follows:
user_id |
date_time |
data |
is_alert |
---|---|---|---|
1 |
2016-06-22 16:11:30 |
22 |
TRUE |
1 |
2016-06-22 16:12:16 |
25 |
TRUE |
1 |
2016-06-22 16:13:39 |
2633 |
TRUE |
1 |
2016-06-22 21:16:29 |
337 |
FALSE |
1 |
2016-06-22 22:41:59 |
21 |
FALSE |
1 |
2016-06-23 8:38:17 |
28 |
FALSE |
1 |
2016-06-23 9:01:47 |
70900 |
TRUE |
2 |
2016-06-23 17:51:44 |
9 |
TRUE |
2 |
2016-06-23 18:02:56 |
10 |
TRUE |
2 |
2016-06-23 18:06:01 |
9 |
FALSE |
2 |
2016-06-23 18:19:31 |
887 |
TRUE |
2 |
2016-06-24 8:30:21 |
34 |
FALSE |
2 |
2016-06-24 8:31:04 |
340000 |
TRUE |
1 |
2016-06-24 9:06:44 |
6310 |
FALSE |
⋮ |
The following code filters only alerts, which produces the final output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
SELECT user_id ,date_time ,data FROM (SELECT user_id ,date_time ,data ,data > COALESCE(SUM(data) OVER (PARTITION BY user_id ORDER BY date_time ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING), 0) AS is_alert FROM calls WHERE data IS NOT null) AS alerts_flagmap WHERE is_alert = TRUE ORDER BY date_time; |
Creating a boolean attribute by window function is a simple way how to “cherry-pick” rows with some specific property from the dataset. The table below contains only qualifying alerts according to our rule. Note that it is easy to change the requirement to 10 or 100 preceding rows by altering just one number in the query.
user_id |
date_time |
data |
---|---|---|
1 |
2016-06-22 16:11:30 |
22 |
1 |
2016-06-22 16:12:16 |
25 |
1 |
2016-06-22 16:13:39 |
2633 |
1 |
2016-06-23 9:01:47 |
70900 |
2 |
2016-06-23 17:51:44 |
9 |
2 |
2016-06-23 18:02:56 |
10 |
2 |
2016-06-23 18:19:31 |
887 |
2 |
2016-06-24 8:31:04 |
340000 |
1 |
2016-06-25 17:50:51 |
14980 |
2 |
2016-06-28 7:57:09 |
1159600 |
⋮ |
Frame defined by RANGE
As long as we want to aggregate over individual entries from the table, the ROWS clause is the right choice. Now imagine that you want to trigger an alert every time the current usage exceeds the total usage over the past 24 hours. The time frame of the previous 24 hours could include 50 rows, 1 row or none. A seemingly correct solution is to aggregate and sum the usage by 24 hours and use the LAG function, however, it does not produce the expected output. The time frame should be the last 24 hours, not the previous day. Let’s show how the RANGE clause is made just for this use case.
[rs-button]
The RANGE clause limits the frame to contain rows that have its value within the specified range, relative to the current value. It operates logically on values, whereas the ROWS clause operates physically on rows of the table. Before we apply the RANGE clause to our use case, it is important to understand how the frame is defined on the following small sample. Let’s have a table temp with a single column num and values 1,2,5,5.
1 2 3 4 5 6 7 8 |
SELECT num ,COUNT(*) OVER (ORDER BY num ROWS BETWEEN 2 PRECEDING AND CURRENT ROW ) AS rows_count ,COUNT(*) OVER (ORDER BY num RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) AS range_count ,SUM(num) OVER (ORDER BY num ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS rows_sum ,SUM(num) OVER (ORDER BY num RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) AS range_sum FROM temp ORDER BY num; |
We created four values in a temporary table temp and calculated COUNT and SUM over a fixed-size frame bordered by the second before current and the current row/value. You can compare how the results differ for ROWS and RANGE clauses:
num |
rows_count |
range_count |
rows_sum |
range_sum |
---|---|---|---|---|
1 |
1 |
1 |
1 |
1 |
2 |
2 |
2 |
3 |
3 |
5 |
3 |
2 |
8 |
10 |
5 |
3 |
2 |
12 |
10 |
The COUNT for the ROWS must be always 3 except for the first two rows since the frame contains the row before previous (1.), the previous (2.) and the current (3.). The situation is more dynamic for the RANGE clause. Here, the query engine subtracts 2 from the current value and looks for the rows in the range from this number to the current value. For example, at the third row, the range is (5 – 2, 5) = (3,5) and only the last two rows (with value 5) have the num value in this interval, therefore the count is 2. If you understand this idea then the SUM columns should be no surprise.
The CURRENT ROW together with the RANGE clause is often a source of misunderstanding because it behaves differently from ROWS on multiple equal values in the sequence. Since the RANGE version substitutes CURRENT ROW for the value 5 in the example above, it understands the frame “up to 5”, and therefore, all rows containing the value 5 are present in the frame, regardless of how many rows before or after the current one appear in the sequence.
[cloud_book_banner]
Unfortunately, Redshift developers have not yet implemented the RANGE clause and PostgreSQL does not allow to customize the frame by any values other than UNBOUNDED or CURRENT ROW. The capabilities are then very similar to the ROWS clause, however, the one difference is the behaviour for multiple equal values in the sequence, which are treated in a little different way for RANGE and ROWS, as we have seen earlier.
As a consequence, we will use Google BigQuery engine to explore capabilities of the RANGE clause. The following table presents RANGE support among the aforementioned three databases and Oracle, which provides the full support. In the following post, we will introduce much more features of the window functions and compare them among all kinds of databases.
RANGE clause |
numeric values |
date values | |
---|---|---|---|
Redshift |
✘ |
✘ |
✘ |
PostgreSQL |
✓ |
✘ |
✘ |
BigQuery |
✓ |
✓ |
✘ |
Oracle |
✓ |
✓ |
✓ |
Let’s return to our use case of the internet usage. We will stick to the idea of the past 24 hours: alert is triggered every time the current usage exceeds the total usage over the past 24 hours. Now we know that the easiest way how to achieve it is to use the RANGE clause. The BigQuery supports any numeric values inside RANGE clause, however, we cannot use any others, such as date or timestamp. As we are using date type in our use case, we cannot put it in the statement directly. As a workaround, we will use a function UNIX_SECONDS, which converts the timestamp into the integer of seconds in the Unix time. Next, we define the frame as 24 hours in seconds, which is 60 * 60 * 24 = 86400.
1 2 3 4 5 6 7 8 9 |
SELECT user_id ,date_time ,data ,data > COALESCE(SUM(data) OVER (ORDER BY UNIX_SECONDS(date_time) RANGE BETWEEN 86400 PRECEDING AND 1 PRECEDING), 0) AS is_alert FROM calls WHERE data IS NOT null ORDER BY date_time; |
Again, we want to leave out the current usage from the sum, therefore, we use 1 PRECEDING as the end of the frame. Let’s see the output:
user_id |
date_time |
data |
is_alert |
---|---|---|---|
1 |
2016-06-22 16:11:30 |
22 |
TRUE |
1 |
2016-06-22 16:12:16 |
25 |
TRUE |
1 |
2016-06-22 16:13:39 |
2633 |
TRUE |
1 |
2016-06-22 21:16:29 |
337 |
FALSE |
1 |
2016-06-22 22:41:59 |
21 |
FALSE |
1 |
2016-06-23 08:38:17 |
28 |
FALSE |
1 |
2016-06-23 09:01:47 |
70900 |
TRUE |
2 |
2016-06-23 17:51:44 |
9 |
TRUE |
2 |
2016-06-23 18:02:56 |
10 |
TRUE |
2 |
2016-06-23 18:06:01 |
9 |
FALSE |
2 |
2016-06-23 18:19:31 |
887 |
TRUE |
2 |
2016-06-24 08:30:21 |
34 |
FALSE |
2 |
2016-06-24 08:31:04 |
340000 |
TRUE |
1 |
2016-06-24 09:06:44 |
6310 |
TRUE |
⋮ |
Note the last row, which is now true, as the last 24 hours does not even cover the previous usage of the customer with id 1, which was at 9:01:47 on 6/23. Therefore it is his large data usage after a long time, thus considered as an alert. Whereas in the ROWS variant, the sum was computed from the previous five rows, which reach more than a day into the past and so alert was not triggered.
The following query wraps the previous output and filters only the rows with the positive alert flag so that we can see only the alerts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
SELECT user_id ,date_time ,data FROM (SELECT user_id ,date_time ,data ,data > SUM(data) OVER (ORDER BY UNIX_SECONDS(date_time) RANGE BETWEEN 86400 PRECEDING AND 1 PRECEDING) AS is_alert FROM calls WHERE data IS NOT null) AS alerts_flagmap WHERE is_alert = TRUE ORDER BY date_time; |
Output:
user_id |
date_time |
data |
---|---|---|
1 |
2016-06-22 16:11:30 |
22 |
1 |
2016-06-22 16:12:16 |
25 |
1 |
2016-06-22 16:13:39 |
2633 |
1 |
2016-06-23 09:01:47 |
70900 |
2 |
2016-06-23 17:51:44 |
9 |
2 |
2016-06-23 18:02:56 |
10 |
2 |
2016-06-23 18:19:31 |
887 |
2 |
2016-06-24 08:31:04 |
340000 |
1 |
2016-06-24 09:06:44 |
6310 |
1 |
2016-06-24 15:50:44 |
39088 |
⋮ |