SQL,

Redshift's Window Functions Advanced use case – Sessionization

Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.


Published on August 14, 2017
Updated on November 20, 2024


In the last post about the Window Functions, we introduced an advanced use case, in which the window functions help to make the query more readable, simple and efficient. The problem was to find free call intervals for each customer, which are created as customers tops-up their credit for at least €20 and get free calls for the next 28 days. The solution lay in computing the number of days since the previous/next top up including only top-ups of €20 or greater, indentifying first and last top ups of intervals and filtering out the inner top ups.
The approach presented in the previous post has some advantages, for example, the logic can be easily inverted to extract the information about the periods without the free credit. It has, however, a hard to miss downside – it does not allow us to aggregate all the top-ups, ie. what if we want to include all those €5 or €10 top-ups that might have also happened during promotional period into aggregates? This requires an alternative approach in which all the events are tagged respectively instead of filtered out. We will introduce such an approach in this post.
Finding the free call intervals can be compared to the process of a sessionization. You can understand the sessionization as finding sessions in time series by grouping the data into sessions. A simple approach is to apply fixed time interval on the periods of inactivity and if the inactivity lasts longer than the interval, cut the session and start a new one. It could be 30 minutes in the context of clickstreams of some web application, which might then use the individual sessions to make some statistics about the users’ behaviour. In the context of our use case, the interval is 28 days and we understand the free call interval as a session. The task is little more complicated since only some of the top ups determine the sessions.
Again, let’s break the query into simple, functional steps. The plan is as follows:

  • Firstly, we calculate the number of days since the previous €20+ top-up for each top-up.
  • Secondly, we use this information to mark the first top-up of each interval. Moreover, we determine the top-ups that belong to some interval to filter out the others.
  • Thirdly, we assign a tag to each top-up, so that all top-ups in the same interval have the same customer id and tag. Besides, we calculate the end dates of the intervals.
  • Finally, we group over the customer id and tag to produce arbitrary aggregation result over all top-ups in the interval. Lastly, we extract the starting and ending dates of intervals.

To demonstrate the use case of the top-ups aggregation, we print also a sum of all credits that a customer topped-up in each interval.

Step 1: Days since the previous qualifying top-up

First, we compute the number of days from the previous €20+ top-up. It’s similar to what we did before but this time we will include but ignore the top-ups with less than €20 value. Therefore, we can interpret the value as the number of days since the nearest previous top-up with more than €20.

CREATE TABLE topups_enriched AS
SELECT
      topups.*
     ,date - MAX(CASE WHEN topup_value >= 20 THEN date END) OVER (
            PARTITION BY user_id
            ORDER BY date, top_up_seq ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
            ) AS days_since_prev_qualifying_topup
FROM topups
ORDER BY user_id, date DESC, top_up_seq DESC;
  • The OVER clause creates a frame from the first row (still) till the previous one (moving), so it’s ordered by date. That means from the oldest top-up till the previous. The top_up_seq field ensures that the output will be deterministic even for multiple top-ups per day.
  • The CASE clause returns the date if the current top-up has a value of 20 or more. Otherwise, it returns null.
  • We search the MAX value in the frame because out of all dates marking the high top-ups, we want to pick relatively the most recent one. Of course, the function skips null values, hence we truly get the date of the last high top-up.

Output (we leave out topup_seq):

user_id date topup_value days_since_prev_qualifying_topup
4 2017-06-17 25 18
4 2017-06-07 10 8
4 2017-06-05 5 6
4 2017-05-30 25 76
4 2017-04-22 10 38
4 2017-04-10 5 26
4 2017-03-15 25 28
4 2017-03-09 5 22
4 2017-02-25 15 10
4 2017-02-15 20 27
4 2017-01-25 5 6
4 2017-01-19 25 19
4 2017-01-02 10 2
4 2016-12-31 20 11
4 2016-12-20 20  
5 2016-12-31 25 45
5 2016-12-11 5 25
5 2016-12-07 10 21
5 2016-11-17 15 1
5 2016-11-16 25 0
5 2016-11-16 20  

We can check that the value for the customer with id 5 on 2016-12-07 is 21 (days from the high top-up 2016-11-16), and it is 25 on 2016-12-11 (again from 2016-11-16). Let’s track again the same free call intervals, coloured with the green and blue colour. As opposed to the previous approach, the intervals now also contain the top-ups with a value less than €20. As an example, see the top-up on 2017-04-10 for the customer with id 4: it also falls into the free call interval because it is within the 28 days since the last €20+ top-up of the interval (which is 2017-03-15). Therefore, these top-ups belong to the interval too and we will use them to compute the sum of all top-ups in an interval.

Step 2: Mark top-ups that belong to a free call interval

Based on the value from the previous step, we can determine which top-ups belong to some free call interval and which does not occur in any. The condition follows: if the top-up has value more than or equal to €20 or is no more than 28 days later than the previous qualifying top-up, then it belongs to some free call interval.
Besides, we will also mark the interval starting top-ups, which is going to be useful for the period sequence calculation. We know which top-up starts the interval because the value must be more than or equal to €20 and the previous qualifying top-up must be more than 28 days ago.
This gives two extra flags we will include in calculations:

CREATE TABLE topups_flags AS
SELECT
      topups_enriched.*
      -- belonging to some free call interval flag
     ,CASE WHEN NOT(COALESCE(days_since_prev_qualifying_topup, 30) > 28)
            OR topup_value >= 20 THEN 1
      END AS belongs_to_interval
      -- first top-up of an interval flag
     ,CASE WHEN COALESCE(days_since_prev_qualifying_topup, 30) > 28
            AND topup_value >= 20 THEN 1
      END AS interval_start_flag
FROM topups_enriched
ORDER BY user_id, date DESC, top_up_seq DESC;
user_id date topup_value days_since_prev_qualifying_topup belongs_to_interval interval_start_flag
4 2017-06-17 25 18 1  
4 2017-06-07 10 8 1  
4 2017-06-05 5 6 1  
4 2017-05-30 25 76 1 1
4 2017-04-22 10 38   
4 2017-04-10 5 26 1  
4 2017-03-15 25 28 1  
4 2017-03-09 5 22 1  
4 2017-02-25 15 10 1  
4 2017-02-15 20 27 1  
4 2017-01-25 5 6 1  
4 2017-01-19 25 19 1  
4 2017-01-02 10 2 1  
4 2016-12-31 20 11 1  
4 2016-12-20 20   1 1
5 2016-12-31 25 45 1 1
5 2016-12-11 5 25 1  
5 2016-12-07 10 21 1  
5 2016-11-17 15 1 1  
5 2016-11-16 25 0 1  
5 2016-11-16 20   1 1

Now we can judge from the table that top-up of the customer with id 4 on 2017-04-22 is done outside fee call promotion. Also, in each coloured interval is correctly only one positive interval_start_flag and all belongs_to_interval flags are positive within both intervals.

Step 3: Tagging the top-ups

Now we are prepared to tag the top-ups so that all the top-ups with the same tag belong to the same period. A tag is represented by a sequence number of a free call interval for a customer. Besides, we will also compute the end dates of the intervals.

CREATE TABLE topups_tagged AS
SELECT
      user_id
     ,date
     ,topup_value
      -- 1. TAG: sequence number of the interval calculated as sum of interval_start_flag
     ,SUM(interval_start_flag) OVER (PARTITION BY user_id ORDER BY date, top_up_seq
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_seq
      -- 2. end of the period computed as qualifying top-up plus 28 days
     ,MAX(CASE WHEN topup_value >= 20 THEN date + 28 END)
            OVER (PARTITION BY user_id ORDER BY date, top_up_seq
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_end
FROM topups_flags
-- leaving out the out-of-interval top-ups
WHERE belongs_to_interval = 1
ORDER BY user_id, date DESC;

Notice that we filtered out the out-of-interval top-ups in the WHERE clause prior to other computations.

  1. With the interval_seq field, we are numbering the intervals for each customer from 1. We will take advantage of a simple feature of the interval_start_flag: the sum of values in a single interval is one. Therefore, if we sum the values from the beginning, the result increments with every new interval.
  2. The interval_end computation is using CASE statement inside the MAX window function, which takes the most recent date out of the qualifying top-ups’ dates, plus 28 days, as an end of the period.

Output:

user_id date topup_value interval_seq interval_end
4 2017-06-17 25 2 2017-07-15
4 2017-06-07 10 2 2017-06-27
4 2017-06-05 5 2 2017-06-27
4 2017-05-30 25 2 2017-06-27
4 2017-04-10 5 1 2017-04-12
4 2017-03-15 25 1 2017-04-12
4 2017-03-09 5 1 2017-03-15
4 2017-02-25 15 1 2017-03-15
4 2017-02-15 20 1 2017-03-15
4 2017-01-25 5 1 2017-02-16
4 2017-01-19 25 1 2017-02-16
4 2017-01-02 10 1 2017-01-28
4 2016-12-31 20 1 2017-01-28
4 2016-12-20 20 1 2017-01-17
5 2016-12-31 25 2 2017-01-28
5 2016-12-11 5 1 2016-12-14
5 2016-12-07 10 1 2016-12-14
5 2016-11-17 15 1 2016-12-14
5 2016-11-16 20 1 2016-12-14
5 2016-11-16 25 1 2016-12-14

We can see here that all rows which are supposed to be merged into a single interval (same colour, same seq value), expectedly contain the same value in the interval_seq column.

Step 4: Grouping top-ups into intervals

The last step is to group the top-ups by the customer and tag (interval_seq field). We will extract a start of an interval as the earliest top-up date in the interval. The end of the interval is calculated as the latest date from the column interval_end. The reason is that the cumulative character of the window functions have not allowed us to discover the end of the interval at the first row and we were updating it until the real end of the interval. The last column tells us the desired sum of all top-up values over whole free call interval.

SELECT
      user_id
     ,interval_seq
     ,MIN(date) AS interval_start
     ,MAX(interval_end) AS interval_end
     ,SUM(topup_value) AS topup_sum
FROM topups_tagged
GROUP BY user_id, interval_seq
ORDER BY user_id, interval_start DESC;
Solve my Redshift Problems

Output:

user_id interval_seq interval_start interval_end topup_sum
1 3 2017-03-20 2017-05-15 40
1 2 2017-01-03 2017-02-25 65
1 1 2016-11-07 2017-01-02 55
2 1 2017-01-15 2017-02-12 40
3 4 2017-05-11 2017-06-08 30
3 3 2017-03-15 2017-05-03 45
3 2 2017-02-01 2017-03-01 30
3 1 2016-12-19 2017-01-16 25
4 2 2017-05-30 2017-07-15 65
4 1 2016-12-20 2017-04-12 150
5 4 2017-05-31 2017-07-13 65
5 3 2017-04-29 2017-05-27 30
5 2 2016-12-31 2017-02-23 60
5 1 2016-11-16 2016-12-14 75

As we can see, the sum of all top-ups in the first interval of the customer 5 is 75, which truly follows 5+10+15+25+20. You can check that the starts and ends of all intervals agree with the final intervals of the first approach.

All in one single query

Finally, let’s show the single query similarly as in the previous approach:

WITH topups_enriched AS (
  SELECT
        topups.*
       ,date - MAX(CASE WHEN topup_value >= 20 THEN date END) OVER (
              PARTITION BY user_id
              ORDER BY date, top_up_seq ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
              ) AS days_since_prev_qualifying_topup
  FROM topups
), topups_tagged AS (
  SELECT
        user_id
       ,date
       ,topup_value
        -- TAG: sequence number of the interval calculated as sum of interval_start_flag
       ,SUM(CASE WHEN COALESCE(days_since_prev_qualifying_topup, 30) > 28 AND
              topup_value>=20 THEN 1 END)
              OVER (PARTITION BY user_id ORDER BY date, top_up_seq
              ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_seq
        -- end of the period computed as qualifying top-up plus 28 days
       ,MAX(CASE WHEN topup_value >= 20 THEN date + 28 END)
              OVER (PARTITION BY user_id ORDER BY date, top_up_seq
              ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_end
  FROM topups_enriched
  -- leaving out the out-of-interval top-ups
  WHERE NOT(COALESCE(days_since_prev_qualifying_topup, 30) > 28) OR topup_value >= 20
) SELECT
      user_id
     ,interval_seq
     ,MIN(date) AS interval_start
     ,MAX(interval_end) AS interval_end
     ,SUM(topup_value) AS topup_sum
FROM topups_tagged
GROUP BY user_id, interval_seq
ORDER BY user_id, interval_start DESC;

In the previous approach, we left out the subquery that did not use any window functions (extra_flags). We apply the same rule in this case and omit the table topups_flags, which created flags only to illustrate the idea. 

Uli Bethke

About the author:

Uli Bethke

Co-founder of Sonra

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.