Redshift's Window Functions Advanced use case – Sessionization
In the last post about the Window Functions, we introduced an advanced use case, in which the window functions help to make the query more readable, simple and efficient. The problem was to find free call intervals for each customer, which are created as customers tops-up their credit for at least €20 and get free calls for the next 28 days. The solution lay in computing the number of days since the previous/next top up including only top-ups of €20 or greater, indentifying first and last top ups of intervals and filtering out the inner top ups.
The approach presented in the previous post has some advantages, for example, the logic can be easily inverted to extract the information about the periods without the free credit. It has, however, a hard to miss downside – it does not allow us to aggregate all the top-ups, ie. what if we want to include all those €5 or €10 top-ups that might have also happened during promotional period into aggregates? This requires an alternative approach in which all the events are tagged respectively instead of filtered out. We will introduce such an approach in this post.
Finding the free call intervals can be compared to the process of a sessionization. You can understand the sessionization as finding sessions in time series by grouping the data into sessions. A simple approach is to apply fixed time interval on the periods of inactivity and if the inactivity lasts longer than the interval, cut the session and start a new one. It could be 30 minutes in the context of clickstreams of some web application, which might then use the individual sessions to make some statistics about the users’ behaviour. In the context of our use case, the interval is 28 days and we understand the free call interval as a session. The task is little more complicated since only some of the top ups determine the sessions.
Again, let’s break the query into simple, functional steps. The plan is as follows:
- Firstly, we calculate the number of days since the previous €20+ top-up for each top-up.
- Secondly, we use this information to mark the first top-up of each interval. Moreover, we determine the top-ups that belong to some interval to filter out the others.
- Thirdly, we assign a tag to each top-up, so that all top-ups in the same interval have the same customer id and tag. Besides, we calculate the end dates of the intervals.
- Finally, we group over the customer id and tag to produce arbitrary aggregation result over all top-ups in the interval. Lastly, we extract the starting and ending dates of intervals.
To demonstrate the use case of the top-ups aggregation, we print also a sum of all credits that a customer topped-up in each interval.
Step 1: Days since the previous qualifying top-up
First, we compute the number of days from the previous €20+ top-up. It’s similar to what we did before but this time we will include but ignore the top-ups with less than €20 value. Therefore, we can interpret the value as the number of days since the nearest previous top-up with more than €20.
1 2 3 4 5 6 7 8 9 |
CREATE TABLE topups_enriched AS SELECT topups.* ,date - MAX(CASE WHEN topup_value >= 20 THEN date END) OVER ( PARTITION BY user_id ORDER BY date, top_up_seq ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) AS days_since_prev_qualifying_topup FROM topups ORDER BY user_id, date DESC, top_up_seq DESC; |
- The OVER clause creates a frame from the first row (still) till the previous one (moving), so it’s ordered by date. That means from the oldest top-up till the previous. The top_up_seq field ensures that the output will be deterministic even for multiple top-ups per day.
- The CASE clause returns the date if the current top-up has a value of 20 or more. Otherwise, it returns null.
- We search the MAX value in the frame because out of all dates marking the high top-ups, we want to pick relatively the most recent one. Of course, the function skips null values, hence we truly get the date of the last high top-up.
Output (we leave out topup_seq):
user_id | date | topup_value | days_since_prev_qualifying_topup |
---|---|---|---|
⋮ | |||
4 | 2017-06-17 | 25 | 18 |
4 | 2017-06-07 | 10 | 8 |
4 | 2017-06-05 | 5 | 6 |
4 | 2017-05-30 | 25 | 76 |
4 | 2017-04-22 | 10 | 38 |
4 | 2017-04-10 | 5 | 26 |
4 | 2017-03-15 | 25 | 28 |
4 | 2017-03-09 | 5 | 22 |
4 | 2017-02-25 | 15 | 10 |
4 | 2017-02-15 | 20 | 27 |
4 | 2017-01-25 | 5 | 6 |
4 | 2017-01-19 | 25 | 19 |
4 | 2017-01-02 | 10 | 2 |
4 | 2016-12-31 | 20 | 11 |
4 | 2016-12-20 | 20 | |
⋮ | |||
5 | 2016-12-31 | 25 | 45 |
5 | 2016-12-11 | 5 | 25 |
5 | 2016-12-07 | 10 | 21 |
5 | 2016-11-17 | 15 | 1 |
5 | 2016-11-16 | 25 | 0 |
5 | 2016-11-16 | 20 |
We can check that the value for the customer with id 5 on 2016-12-07 is 21 (days from the high top-up 2016-11-16), and it is 25 on 2016-12-11 (again from 2016-11-16). Let’s track again the same free call intervals, coloured with the green and blue colour. As opposed to the previous approach, the intervals now also contain the top-ups with a value less than €20. As an example, see the top-up on 2017-04-10 for the customer with id 4: it also falls into the free call interval because it is within the 28 days since the last €20+ top-up of the interval (which is 2017-03-15). Therefore, these top-ups belong to the interval too and we will use them to compute the sum of all top-ups in an interval.
Step 2: Mark top-ups that belong to a free call interval
Based on the value from the previous step, we can determine which top-ups belong to some free call interval and which does not occur in any. The condition follows: if the top-up has value more than or equal to €20 or is no more than 28 days later than the previous qualifying top-up, then it belongs to some free call interval.
Besides, we will also mark the interval starting top-ups, which is going to be useful for the period sequence calculation. We know which top-up starts the interval because the value must be more than or equal to €20 and the previous qualifying top-up must be more than 28 days ago.
This gives two extra flags we will include in calculations:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CREATE TABLE topups_flags AS SELECT topups_enriched.* -- belonging to some free call interval flag ,CASE WHEN NOT(COALESCE(days_since_prev_qualifying_topup, 30) > 28) OR topup_value >= 20 THEN 1 END AS belongs_to_interval -- first top-up of an interval flag ,CASE WHEN COALESCE(days_since_prev_qualifying_topup, 30) > 28 AND topup_value >= 20 THEN 1 END AS interval_start_flag FROM topups_enriched ORDER BY user_id, date DESC, top_up_seq DESC; |
user_id | date | topup_value | days_since_prev_qualifying_topup | belongs_to_interval | interval_start_flag |
---|---|---|---|---|---|
4 | 2017-06-17 | 25 | 18 | 1 | |
4 | 2017-06-07 | 10 | 8 | 1 | |
4 | 2017-06-05 | 5 | 6 | 1 | |
4 | 2017-05-30 | 25 | 76 | 1 | 1 |
4 | 2017-04-22 | 10 | 38 | ||
4 | 2017-04-10 | 5 | 26 | 1 | |
4 | 2017-03-15 | 25 | 28 | 1 | |
4 | 2017-03-09 | 5 | 22 | 1 | |
4 | 2017-02-25 | 15 | 10 | 1 | |
4 | 2017-02-15 | 20 | 27 | 1 | |
4 | 2017-01-25 | 5 | 6 | 1 | |
4 | 2017-01-19 | 25 | 19 | 1 | |
4 | 2017-01-02 | 10 | 2 | 1 | |
4 | 2016-12-31 | 20 | 11 | 1 | |
4 | 2016-12-20 | 20 | 1 | 1 | |
⋮ | |||||
5 | 2016-12-31 | 25 | 45 | 1 | 1 |
5 | 2016-12-11 | 5 | 25 | 1 | |
5 | 2016-12-07 | 10 | 21 | 1 | |
5 | 2016-11-17 | 15 | 1 | 1 | |
5 | 2016-11-16 | 25 | 0 | 1 | |
5 | 2016-11-16 | 20 | 1 | 1 |
Now we can judge from the table that top-up of the customer with id 4 on 2017-04-22 is done outside fee call promotion. Also, in each coloured interval is correctly only one positive interval_start_flag and all belongs_to_interval flags are positive within both intervals.
Step 3: Tagging the top-ups
Now we are prepared to tag the top-ups so that all the top-ups with the same tag belong to the same period. A tag is represented by a sequence number of a free call interval for a customer. Besides, we will also compute the end dates of the intervals.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
CREATE TABLE topups_tagged AS SELECT user_id ,date ,topup_value -- 1. TAG: sequence number of the interval calculated as sum of interval_start_flag ,SUM(interval_start_flag) OVER (PARTITION BY user_id ORDER BY date, top_up_seq ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_seq -- 2. end of the period computed as qualifying top-up plus 28 days ,MAX(CASE WHEN topup_value >= 20 THEN date + 28 END) OVER (PARTITION BY user_id ORDER BY date, top_up_seq ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_end FROM topups_flags -- leaving out the out-of-interval top-ups WHERE belongs_to_interval = 1 ORDER BY user_id, date DESC; |
Notice that we filtered out the out-of-interval top-ups in the WHERE clause prior to other computations.
- With the interval_seq field, we are numbering the intervals for each customer from 1. We will take advantage of a simple feature of the interval_start_flag: the sum of values in a single interval is one. Therefore, if we sum the values from the beginning, the result increments with every new interval.
- The frame size in the OVER clause is defined from the first row till the current one because we want to look only at the past and current top-ups to get the correct tag.
- The SUM function ignores null values, hence increments the result only with 1 value. This ensures that all top-ups that belong to the same interval get the same tag.
- The interval_end computation is using CASE statement inside the MAX window function, which takes the most recent date out of the qualifying top-ups’ dates, plus 28 days, as an end of the period.
- The OVER clause is defined again from the first row till the current one.
- The CASE clause returns the current date plus 28 if the current top-up is qualifying. Importantly, when a top-up is not qualifying, then the CASE clause produces null value, which will lead the MAX function to ignore all non-qualifying top-ups.
- The result is the expected end of the interval under the condition that no other qualifying top-up pops up in the current interval at the future rows. Thus, only the date at the last top-up of the interval will be important for us.
Output:
user_id | date | topup_value | interval_seq | interval_end |
---|---|---|---|---|
⋮ | ||||
4 | 2017-06-17 | 25 | 2 | 2017-07-15 |
4 | 2017-06-07 | 10 | 2 | 2017-06-27 |
4 | 2017-06-05 | 5 | 2 | 2017-06-27 |
4 | 2017-05-30 | 25 | 2 | 2017-06-27 |
4 | 2017-04-10 | 5 | 1 | 2017-04-12 |
4 | 2017-03-15 | 25 | 1 | 2017-04-12 |
4 | 2017-03-09 | 5 | 1 | 2017-03-15 |
4 | 2017-02-25 | 15 | 1 | 2017-03-15 |
4 | 2017-02-15 | 20 | 1 | 2017-03-15 |
4 | 2017-01-25 | 5 | 1 | 2017-02-16 |
4 | 2017-01-19 | 25 | 1 | 2017-02-16 |
4 | 2017-01-02 | 10 | 1 | 2017-01-28 |
4 | 2016-12-31 | 20 | 1 | 2017-01-28 |
4 | 2016-12-20 | 20 | 1 | 2017-01-17 |
⋮ | ||||
5 | 2016-12-31 | 25 | 2 | 2017-01-28 |
5 | 2016-12-11 | 5 | 1 | 2016-12-14 |
5 | 2016-12-07 | 10 | 1 | 2016-12-14 |
5 | 2016-11-17 | 15 | 1 | 2016-12-14 |
5 | 2016-11-16 | 20 | 1 | 2016-12-14 |
5 | 2016-11-16 | 25 | 1 | 2016-12-14 |
We can see here that all rows which are supposed to be merged into a single interval (same colour, same seq value), expectedly contain the same value in the interval_seq column.
Step 4: Grouping top-ups into intervals
The last step is to group the top-ups by the customer and tag (interval_seq field). We will extract a start of an interval as the earliest top-up date in the interval. The end of the interval is calculated as the latest date from the column interval_end. The reason is that the cumulative character of the window functions have not allowed us to discover the end of the interval at the first row and we were updating it until the real end of the interval. The last column tells us the desired sum of all top-up values over whole free call interval.
1 2 3 4 5 6 7 8 9 |
SELECT user_id ,interval_seq ,MIN(date) AS interval_start ,MAX(interval_end) AS interval_end ,SUM(topup_value) AS topup_sum FROM topups_tagged GROUP BY user_id, interval_seq ORDER BY user_id, interval_start DESC; |
Output:
user_id | interval_seq | interval_start | interval_end | topup_sum |
---|---|---|---|---|
1 | 3 | 2017-03-20 | 2017-05-15 | 40 |
1 | 2 | 2017-01-03 | 2017-02-25 | 65 |
1 | 1 | 2016-11-07 | 2017-01-02 | 55 |
2 | 1 | 2017-01-15 | 2017-02-12 | 40 |
3 | 4 | 2017-05-11 | 2017-06-08 | 30 |
3 | 3 | 2017-03-15 | 2017-05-03 | 45 |
3 | 2 | 2017-02-01 | 2017-03-01 | 30 |
3 | 1 | 2016-12-19 | 2017-01-16 | 25 |
4 | 2 | 2017-05-30 | 2017-07-15 | 65 |
4 | 1 | 2016-12-20 | 2017-04-12 | 150 |
5 | 4 | 2017-05-31 | 2017-07-13 | 65 |
5 | 3 | 2017-04-29 | 2017-05-27 | 30 |
5 | 2 | 2016-12-31 | 2017-02-23 | 60 |
5 | 1 | 2016-11-16 | 2016-12-14 | 75 |
As we can see, the sum of all top-ups in the first interval of the customer 5 is 75, which truly follows 5+10+15+25+20. You can check that the starts and ends of all intervals agree with the final intervals of the first approach.
All in one single query
Finally, let’s show the single query similarly as in the previous approach:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
WITH topups_enriched AS ( SELECT topups.* ,date - MAX(CASE WHEN topup_value >= 20 THEN date END) OVER ( PARTITION BY user_id ORDER BY date, top_up_seq ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) AS days_since_prev_qualifying_topup FROM topups ), topups_tagged AS ( SELECT user_id ,date ,topup_value -- TAG: sequence number of the interval calculated as sum of interval_start_flag ,SUM(CASE WHEN COALESCE(days_since_prev_qualifying_topup, 30) > 28 AND topup_value>=20 THEN 1 END) OVER (PARTITION BY user_id ORDER BY date, top_up_seq ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_seq -- end of the period computed as qualifying top-up plus 28 days ,MAX(CASE WHEN topup_value >= 20 THEN date + 28 END) OVER (PARTITION BY user_id ORDER BY date, top_up_seq ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_end FROM topups_enriched -- leaving out the out-of-interval top-ups WHERE NOT(COALESCE(days_since_prev_qualifying_topup, 30) > 28) OR topup_value >= 20 ) SELECT user_id ,interval_seq ,MIN(date) AS interval_start ,MAX(interval_end) AS interval_end ,SUM(topup_value) AS topup_sum FROM topups_tagged GROUP BY user_id, interval_seq ORDER BY user_id, interval_start DESC; |
In the previous approach, we left out the subquery that did not use any window functions (extra_flags). We apply the same rule in this case and omit the table topups_flags, which created flags only to illustrate the idea.