Redshift's Window Functions Advanced use case - Sessionization

Jiří Mauritz Data Warehouse, Redshift, Window Functions

In the last post about the Window Functions, we introduced an advanced use case, in which the window functions help to make the query more readable, simple and efficient. The problem was to find free call intervals for each customer, which are created as customers tops-up their credit for at least €20 and get free calls for the next 28 days. The solution lay in computing the number of days since the previous/next top up including only top-ups of €20 or greater, indentifying first and last top ups of intervals and filtering out the inner top ups.

The approach presented in the previous post has some advantages, for example, the logic can be easily inverted to extract the information about the periods without the free credit. It has, however, a hard to miss downside - it does not allow us to aggregate all the top-ups, ie. what if we want to include all those €5 or €10 top-ups that might have also happened during promotional period into aggregates? This requires an alternative approach in which all the events are tagged respectively instead of filtered out. We will introduce such an approach in this post.

Finding the free call intervals can be compared to the process of a sessionization. You can understand the sessionization as finding sessions in time series by grouping the data into sessions. A simple approach is to apply fixed time interval on the periods of inactivity and if the inactivity lasts longer than the interval, cut the session and start a new one. It could be 30 minutes in the context of clickstreams of some web application, which might then use the individual sessions to make some statistics about the users’ behaviour. In the context of our use case, the interval is 28 days and we understand the free call interval as a session. The task is little more complicated since only some of the top ups determine the sessions.

Again, let’s break the query into simple, functional steps. The plan is as follows:

  • Firstly, we calculate the number of days since the previous €20+ top-up for each top-up.
  • Secondly, we use this information to mark the first top-up of each interval. Moreover, we determine the top-ups that belong to some interval to filter out the others.
  • Thirdly, we assign a tag to each top-up, so that all top-ups in the same interval have the same customer id and tag. Besides, we calculate the end dates of the intervals.
  • Finally, we group over the customer id and tag to produce arbitrary aggregation result over all top-ups in the interval. Lastly, we extract the starting and ending dates of intervals.

To demonstrate the use case of the top-ups aggregation, we print also a sum of all credits that a customer topped-up in each interval.

Step 1: Days since the previous qualifying top-up

First, we compute the number of days from the previous €20+ top-up. It's similar to what we did before but this time we will include but ignore the top-ups with less than €20 value. Therefore, we can interpret the value as the number of days since the nearest previous top-up with more than €20.

 

  • The OVER clause creates a frame from the first row (still) till the previous one (moving), so it's ordered by date. That means from the oldest top-up till the previous. The top_up_seq field ensures that the output will be deterministic even for multiple top-ups per day.
  • The CASE clause returns the date if the current top-up has a value of 20 or more. Otherwise, it returns null.
  • We search the MAX value in the frame because out of all dates marking the high top-ups, we want to pick relatively the most recent one. Of course, the function skips null values, hence we truly get the date of the last high top-up.

Output (we leave out topup_seq):

user_id

date

topup_value

days_since_prev_qualifying_topup

4

2017-06-17

25

18

4

2017-06-07

10

8

4

2017-06-05

5

6

4

2017-05-30

25

76

4

2017-04-22

10

38

4

2017-04-10

5

26

4

2017-03-15

25

28

4

2017-03-09

5

22

4

2017-02-25

15

10

4

2017-02-15

20

27

4

2017-01-25

5

6

4

2017-01-19

25

19

4

2017-01-02

10

2

4

2016-12-31

20

11

4

2016-12-20

20

5

2016-12-31

25

45

5

2016-12-11

5

25

5

2016-12-07

10

21

5

2016-11-17

15

1

5

2016-11-16

25

0

5

2016-11-16

20

We can check that the value for the customer with id 5 on 2016-12-07 is 21 (days from the high top-up 2016-11-16), and it is 25 on 2016-12-11 (again from 2016-11-16). Let’s track again the same free call intervals, coloured with the green and blue colour. As opposed to the previous approach, the intervals now also contain the top-ups with a value less than €20. As an example, see the top-up on 2017-04-10 for the customer with id 4: it also falls into the free call interval because it is within the 28 days since the last €20+ top-up of the interval (which is 2017-03-15). Therefore, these top-ups belong to the interval too and we will use them to compute the sum of all top-ups in an interval.

Step 2: Mark top-ups that belong to a free call interval

Based on the value from the previous step, we can determine which top-ups belong to some free call interval and which does not occur in any. The condition follows: if the top-up has value more than or equal to €20 or is no more than 28 days later than the previous qualifying top-up, then it belongs to some free call interval.

Besides, we will also mark the interval starting top-ups, which is going to be useful for the period sequence calculation. We know which top-up starts the interval because the value must be more than or equal to €20 and the previous qualifying top-up must be more than 28 days ago.

This gives two extra flags we will include in calculations:

 

user_id

date

topup_value

days_since_prev_qualifying_topup

belongs_to_interval

interval_start_flag

4

2017-06-17

25

18

1

4

2017-06-07

10

8

1

4

2017-06-05

5

6

1

4

2017-05-30

25

76

1

1

4

2017-04-22

10

38

4

2017-04-10

5

26

1

4

2017-03-15

25

28

1

4

2017-03-09

5

22

1

4

2017-02-25

15

10

1

4

2017-02-15

20

27

1

4

2017-01-25

5

6

1

4

2017-01-19

25

19

1

4

2017-01-02

10

2

1

4

2016-12-31

20

11

1

4

2016-12-20

20

1

1

5

2016-12-31

25

45

1

1

5

2016-12-11

5

25

1

5

2016-12-07

10

21

1

5

2016-11-17

15

1

1

5

2016-11-16

25

0

1

5

2016-11-16

20

1

1

Now we can judge from the table that top-up of the customer with id 4 on 2017-04-22 is done outside fee call promotion. Also, in each coloured interval is correctly only one positive interval_start_flag and all belongs_to_interval flags are positive within both intervals.

Step 3: Tagging the top-ups

Now we are prepared to tag the top-ups so that all the top-ups with the same tag belong to the same period. A tag is represented by a sequence number of a free call interval for a customer. Besides, we will also compute the end dates of the intervals.

 

Notice that we filtered out the out-of-interval top-ups in the WHERE clause prior to other computations.

  1. With the interval_seq field, we are numbering the intervals for each customer from 1. We will take advantage of a simple feature of the interval_start_flag: the sum of values in a single interval is one. Therefore, if we sum the values from the beginning, the result increments with every new interval.
    • The frame size in the OVER clause is defined from the first row till the current one because we want to look only at the past and current top-ups to get the correct tag.
    • The SUM function ignores null values, hence increments the result only with 1 value. This ensures that all top-ups that belong to the same interval get the same tag.
  2. The interval_end computation is using CASE statement inside the MAX window function, which takes the most recent date out of the qualifying top-ups’ dates, plus 28 days, as an end of the period.
    • The OVER clause is defined again from the first row till the current one.
    • The CASE clause returns the current date plus 28 if the current top-up is qualifying. Importantly, when a top-up is not qualifying, then the CASE clause produces null value, which will lead the MAX function to ignore all non-qualifying top-ups.
    • The result is the expected end of the interval under the condition that no other qualifying top-up pops up in the current interval at the future rows. Thus, only the date at the last top-up of the interval will be important for us.

Output:

user_id

date

topup_value

interval_seq

interval_end

4

2017-06-17

25

2

2017-07-15

4

2017-06-07

10

2

2017-06-27

4

2017-06-05

5

2

2017-06-27

4

2017-05-30

25

2

2017-06-27

4

2017-04-10

5

1

2017-04-12

4

2017-03-15

25

1

2017-04-12

4

2017-03-09

5

1

2017-03-15

4

2017-02-25

15

1

2017-03-15

4

2017-02-15

20

1

2017-03-15

4

2017-01-25

5

1

2017-02-16

4

2017-01-19

25

1

2017-02-16

4

2017-01-02

10

1

2017-01-28

4

2016-12-31

20

1

2017-01-28

4

2016-12-20

20

1

2017-01-17

5

2016-12-31

25

2

2017-01-28

5

2016-12-11

5

1

2016-12-14

5

2016-12-07

10

1

2016-12-14

5

2016-11-17

15

1

2016-12-14

5

2016-11-16

20

1

2016-12-14

5

2016-11-16

25

1

2016-12-14

We can see here that all rows which are supposed to be merged into a single interval (same colour, same seq value), expectedly contain the same value in the interval_seq column.

Step 4: Grouping top-ups into intervals

The last step is to group the top-ups by the customer and tag (interval_seq field). We will extract a start of an interval as the earliest top-up date in the interval. The end of the interval is calculated as the latest date from the column interval_end. The reason is that the cumulative character of the window functions have not allowed us to discover the end of the interval at the first row and we were updating it until the real end of the interval. The last column tells us the desired sum of all top-up values over whole free call interval.

 

Output:

user_id

interval_seq

interval_start

interval_end

topup_sum

1

3

2017-03-20

2017-05-15

40

1

2

2017-01-03

2017-02-25

65

1

1

2016-11-07

2017-01-02

55

2

1

2017-01-15

2017-02-12

40

3

4

2017-05-11

2017-06-08

30

3

3

2017-03-15

2017-05-03

45

3

2

2017-02-01

2017-03-01

30

3

1

2016-12-19

2017-01-16

25

4

2

2017-05-30

2017-07-15

65

4

1

2016-12-20

2017-04-12

150

5

4

2017-05-31

2017-07-13

65

5

3

2017-04-29

2017-05-27

30

5

2

2016-12-31

2017-02-23

60

5

1

2016-11-16

2016-12-14

75

As we can see, the sum of all top-ups in the first interval of the customer 5 is 75, which truly follows 5+10+15+25+20. You can check that the starts and ends of all intervals agree with the final intervals of the first approach.

All in one single query

Finally, let’s show the single query similarly as in the previous approach:

In the previous approach, we left out the subquery that did not use any window functions (extra_flags). We apply the same rule in this case and omit the table topups_flags, which created flags only to illustrate the idea.