Redshift Window Functions Advanced Use Cases
Merging time-based events into periods
Thanks to our previous posts about the window functions, Introduction to Window functions on Redshift and Data exploration with Window functions on Redshift, you should now be familiar with the most common functions that can be used in the OVER clause and how to apply them to your data. Today, we introduce more advanced use cases, which complexity would usually require computation in the application instead of SQL engine. We will show how to achieve the result in SQL using a combination of multiple window functions and other features of SQL.
FlowForward.
All Things Data Engineering
Straight to Your Inbox!
We will work with the same dataset as in the first article. It is available here and contains information about mobile credit top-ups of five customers:
top_up_seq |
user_id |
date |
topup_value |
---|---|---|---|
68 |
3 |
2017-06-28 |
15 |
67 |
5 |
2017-06-26 |
5 |
66 |
2 |
2017-06-20 |
5 |
65 |
1 |
2017-06-20 |
15 |
64 |
4 |
2017-06-17 |
25 |
63 |
5 |
2017-06-15 |
20 |
62 |
5 |
2017-06-12 |
5 |
61 |
4 |
2017-06-07 |
10 |
60 |
4 |
2017-06-05 |
5 |
59 |
2 |
2017-06-05 |
10 |
58 |
5 |
2017-06-02 |
10 |
57 |
5 |
2017-05-31 |
25 |
56 |
4 |
2017-05-30 |
25 |
55 |
3 |
2017-05-25 |
5 |
54 |
1 |
2017-05-22 |
10 |
⋮ |
In the scenario we go through today, the mobile operator runs promotions for pay as you go customers: if you top up for at least €20 you get free calls for the next 28 calendar days. Customers can apply the promotion multiple times within the 28 calendar days and extend the free call period. Our final goal is to compute start and end dates of free call intervals for each customer. Let’s see an example. The customer in our sample data has made three top-ups:
user_id |
date |
topup_value |
---|---|---|
1 |
2017-02-20 |
15 |
1 |
2017-01-28 |
25 |
1 |
2017-01-03 |
25 |
On 2017-01-03 the customer tops-up for €25 (above our threshold of €20). His free call interval begins on this date. He tops-up again for €25 on 2017-01-28, which is 25 days after the first top-up and therefore, still within the free call period. Hence, the free call interval extends for another 28 days and the interval end date is 2017-01-28 plus 28, which is 2017-02-25. Notice that the top-up on 2017-02-20 is below the free call threshold of €20. As a result, the free call period is not extended and ends on 2017-02-25 (28 days after the last top up above €20).
The free call interval starts on 2017-01-03 and ends on 2017-02-25.
user_id |
interval_start | |
---|---|---|
1 |
2017-01-03 |
2017-02-25 |
To better understand the process of merging periods into intervals, see the picture below. The first three top-ups are chronologically close enough to be merged into an interval, whereas the fourth top-up is not and starts a new interval. Let’s assume all the top-ups are of value €20 or more.
What makes this use case a challenge is the fact that for any number of intersecting periods (applying the top-up multiple times), we want to produce just one interval.
We will present the query in four steps, where the output of each step will be input for the next step. Outputs will be saved as SQL tables for clarity.
- Initially, we include only top-ups of €20 or greater. For each top-up, we compute the number of days between the current top up and the previous/next top up.
- Next, we will use these values to flag those rows with the start and end dates for the free call interval.
- In the third step, we filter out those rows with top-ups that fall inside the free call period, which gives us two rows. One for the start and the other for the end of the free call interval. case top-ups. We also compute ending dates of intervals as its last top-up date plus 28 days.
- We reduce the output from the previous step to just one row to get the start and end dates for each free call interval per customer.
Step 1: Calculating Top-Up Intervals with LAG/LEAD
First, let’s filter out top-ups with less than €20 and enrich the dataset to include columns with the days since the previous top-up and days to the next top-up.
1 2 3 4 5 6 7 8 9 10 11 |
CREATE TABLE topups_enriched AS SELECT user_id ,date ,topup_value ,date - LAG(date,1) OVER (PARTITION BY user_id ORDER BY date) AS days_after_prev_20topup ,LEAD(date,1) OVER (PARTITION BY user_id ORDER BY date) - date AS days_before_next_20topup FROM topups WHERE topup_value >= 20 |
We can simply retrieve the date differences using the LAG function, which looks at the previous row and LEAD function, which looks at the following row. Since we are ordering by date, previous and following rows correspond to the previous and following top-ups.
Result:
user_id |
date |
topup_ |
days_after_prev |
days_before_next_20topup |
---|---|---|---|---|
⋮ | ||||
4 |
2017-06-17 |
25 |
18 | |
4 |
2017-05-30 |
25 |
76 |
18 |
4 |
2017-03-15 |
25 |
28 |
76 |
4 |
2017-02-15 |
20 |
27 |
28 |
4 |
2017-01-19 |
25 |
19 |
27 |
4 |
2016-12-31 |
20 |
11 |
19 |
4 |
2016-12-20 |
20 |
11 | |
5 |
2017-06-15 |
20 |
15 | |
5 |
2017-05-31 |
25 |
32 |
15 |
5 |
2017-04-29 |
20 |
93 |
32 |
5 |
2017-01-26 |
20 |
26 |
93 |
5 |
2016-12-31 |
25 |
45 |
26 |
5 |
2016-11-16 |
20 |
0 |
45 |
5 |
2016-11-16 |
25 |
0 |
Let’s have a look at two examples. The first is a free call interval of the customer with id 4, coloured in blue. This interval starts with his first top-up date (2016-12-20) and ends with the date 2017-03-15 plus 28 days, which is 2017-04-12. We know that if the previous top-up was more than 28 days ago, the current date is the starting date of a new interval. Similarly, if the next top-up is more than 28 days ahead, then we know that the current top-up is the last of the free call interval.
The second occurrence is for the customer with id 5 and is coloured in green. He topped-up €20 or more twice on the same day, but the next top-up is 45 days away, so we expect one free call interval from 2016-11-16 to 2016-12-14 (which is 2016-11-16 plus 28 days).
Step 2: Mark Starts and End Dates of Free Call Periods
Now let’s enrich the dataset even more. We will create 0/1 flags determining whether the current top-up marks the start or the end of an interval. Moreover, we will introduce a flag for the edge top-up, which is simply either start or end of any interval. We will call inner top-ups those that are not edge top-ups and delete them in the next step.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CREATE TABLE extra_flags AS SELECT topups_enriched.* ,CASE WHEN COALESCE(days_after_prev_20topup,30) > 28 OR COALESCE(days_before_next_20topup,30) > 28 THEN 1 ELSE 0 END AS edge_topup_flg ,CASE WHEN COALESCE(days_after_prev_20topup,30) > 28 THEN 1 ELSE 0 END AS starting_topup_flg ,CASE WHEN COALESCE(days_before_next_20topup,30) > 28 THEN 1 ELSE 0 END AS ending_topup_flg FROM topups_enriched; |
The CASE statement is a natural way how to create flags. Note the COALESCE function that returns the first argument and only if the first argument is null, then returns the second one. We are using COALESCE function to return flag “1” in the case of null values. The reason is that the null value in the days_after_prev_20topup column means that it is the very first top-up, which is for sure the start of a period. Analogously for the days_before_next_20topup column.
Result:
user_id |
date |
topup_ |
days_after_prev |
days_before_next |
edge_topup |
starting_topup |
ending_topup |
---|---|---|---|---|---|---|---|
⋮ | |||||||
4 |
2017-06-17 |
25 |
18 |
1 |
0 |
1 | |
4 |
2017-05-30 |
25 |
76 |
18 |
1 |
1 |
0 |
4 |
2017-03-15 |
25 |
28 |
76 |
1 |
0 |
1 |
4 |
2017-02-15 |
20 |
27 |
28 |
0 |
0 |
0 |
4 |
2017-01-19 |
25 |
19 |
27 |
0 |
0 |
0 |
4 |
2016-12-31 |
20 |
11 |
19 |
0 |
0 |
0 |
4 |
2016-12-20 |
20 |
11 |
1 |
1 |
0 | |
5 |
2017-06-15 |
20 |
15 |
1 |
0 |
1 | |
5 |
2017-05-31 |
25 |
32 |
15 |
1 |
1 |
0 |
5 |
2017-04-29 |
20 |
93 |
32 |
1 |
1 |
1 |
5 |
2017-01-26 |
20 |
26 |
93 |
1 |
0 |
1 |
5 |
2016-12-31 |
25 |
45 |
26 |
1 |
1 |
0 |
5 |
2016-11-16 |
20 |
0 |
45 |
1 |
0 |
1 |
5 |
2016-11-16 |
25 |
0 |
1 |
1 |
0 |
For the customer with id 4, the starting_topup_flag flag is true for the first date, and the ending flag is true for the top-up on 2017-03-15. Notice that the top-ups between them have both starting and ending flag false. Since we seek edge dates of intervals, the inner top-ups carry no useful information for us and they need to be filtered out.
Step 3: Filtering inner top-ups and computing ends
In the next step, we filter out the inner top-ups, as we discussed earlier, and compute the end dates of the intervals.
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE TABLE full_ranges AS SELECT user_id ,topup_value ,date AS interval_start ,CASE ending_topup_flg WHEN 1 THEN date + 28 ELSE 28 + LEAD(date) OVER (PARTITION BY user_id ORDER BY date) END AS interval_end ,starting_topup_flg ,ending_topup_flg FROM extra_flags WHERE edge_topup_flg = 1; |
The end dates are computed as the next date in the sequence plus 28 days. Here, it is crucial to understand that the sequence already lacks the inner top-ups (see WHERE clause), and therefore, the output contains only first and last top-up for each free call interval. Now the interesting computation happens at rows with the first top-up of the interval: the LEAD function adds 28 days to the date from the following row, which is the last top-up of the interval. Indeed, this calculates correctly the end date of the interval.
An exception applies for the intervals consisting of only one top-up, indicated by both positive starting and positive ending flag, which ends exactly 28 days after their start.
Result:
user_id |
topup_value |
interval_start |
interval_end |
starting_topup_flg |
ending_topup_flg |
---|---|---|---|---|---|
⋮ | |||||
4 |
25 |
2017-06-17 |
2017-07-15 |
0 |
1 |
4 |
25 |
2017-05-30 |
2017-07-15 |
1 |
0 |
4 |
25 |
2017-03-15 |
2017-04-12 |
0 |
1 |
4 |
20 |
2016-12-20 |
2017-04-12 |
1 |
0 |
5 |
20 |
2017-06-15 |
2017-07-13 |
0 |
1 |
5 |
25 |
2017-05-31 |
2017-07-13 |
1 |
0 |
5 |
20 |
2017-04-29 |
2017-05-27 |
1 |
1 |
5 |
20 |
2017-01-26 |
2017-02-23 |
0 |
1 |
5 |
25 |
2016-12-31 |
2017-02-23 |
1 |
0 |
5 |
20 |
2016-11-16 |
2016-12-14 |
0 |
1 |
5 |
25 |
2016-11-16 |
2016-12-14 |
1 |
0 |
Now we can see that both the green and blue areas have only two rows, as only edge top-ups remained. Moreover, the first blue row with the positive starting flag already contains the end of the interval (2017-04-12), which agrees with our expectation.
Step 4: Final start/end date extraction
The query which computes the final output is simple: we pick the rows with the positive starting_topup_flg flag. The ends of the periods are already present in the interval_end field, and therefore, we can ignore rows with the negative starting flag.
1 2 3 4 5 6 7 |
SELECT user_id ,interval_start ,interval_end FROM full_ranges WHERE starting_topup_flg = 1 ORDER BY user_id, interval_start DESC; |
Result:
user_id |
interval_start |
interval_end |
---|---|---|
1 |
2017-03-20 |
2017-05-15 |
1 |
2017-01-03 |
2017-02-25 |
1 |
2016-11-07 |
2017-01-02 |
2 |
2017-01-15 |
2017-02-12 |
3 |
2017-05-11 |
2017-06-08 |
3 |
2017-03-15 |
2017-05-03 |
3 |
2017-02-01 |
2017-03-01 |
3 |
2016-12-19 |
2017-01-16 |
4 |
2017-05-30 |
2017-07-15 |
4 |
2016-12-20 |
2017-04-12 |
5 |
2017-05-31 |
2017-07-13 |
5 |
2017-04-29 |
2017-05-27 |
5 |
2016-12-31 |
2017-02-23 |
5 |
2016-11-16 |
2016-12-14 |
The blue and green areas were reduced into one row, representing the free call interval. Notice that the dates match with our expectation.
All in one single query
To achieve the final result, we have created three SQL tables so that the process was more understandable. Naturally, we can reach the same result with a single, consolidated query:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
WITH topups_enrich AS ( SELECT user_id ,topup_value ,date -- check if the next top-up is more than 28 days away -> start of the period ,COALESCE(date - LAG(date,1) OVER (PARTITION BY user_id ORDER BY date),30) > 28 AS starting_topup_flg -- check if the previous top-up is more than 28 days away -> end of the period ,COALESCE(LEAD(date,1) OVER (PARTITION BY user_id ORDER BY date) - date,30)>28 AS ending_topup_flg FROM topups WHERE topup_value>=20 ), full_ranges AS( SELECT te.* -- add information about end of the period as (date in the next row + 28) ,CASE WHEN ending_topup_flg THEN date ELSE LEAD(date) OVER (PARTITION BY user_id ORDER BY date) END + 28 AS interval_end FROM topups_enrich AS te -- filter out non-edge top-ups (neither starting nor ending) WHERE starting_topup_flg OR ending_topup_flg ) -- range information is complete -> keep only starting rows SELECT user_id, date AS interval_start, interval_end FROM full_ranges WHERE starting_topup_flg ORDER BY user_id, interval_start DESC; |
Notice that the table extra_flags was only for illustrative purposes and we can get by with the conditions that appeared in the flag definitions. Is it possible to reduce the query even more? Unfortunately not. So what is the difference that makes the extra_flags table needless? Here, we can present a rule of thumb which says that the number of cascading queries is dependent only on the cases where an output of a window function is to be used by another window function. Every cascading dependency requires a new subquery. The SQL language does not allow us to nest the window functions or use an output of one in another in the same SELECT query.
In the query above, the first subquery produces the flags using window functions, which are then used by the second subquery to filter the rows and apply next window function on the result. Clearly, we are not able to combine the logic into a single query. The reason for the final SELECT query is that the WHERE clause logically operates before the window functions and we need a final filtering. Note that no window function was used to create the extra_flags table, and therefore, we can carry its logic to other subqueries.