SQL,

Home » SQL » Redshift's Window Functions Advanced use case – Sessionization

Redshift's Window Functions Advanced use case – Sessionization

by Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.

Published on August 14, 2017
Updated on November 20, 2024

In the last post about the Window Functions, we introduced an advanced use case, in which the window functions help to make the query more readable, simple and efficient. The problem was to find free call intervals for each customer, which are created as customers tops-up their credit for at least €20 and get free calls for the next 28 days. The solution lay in computing the number of days since the previous/next top up including only top-ups of €20 or greater, indentifying first and last top ups of intervals and filtering out the inner top ups.
The approach presented in the previous post has some advantages, for example, the logic can be easily inverted to extract the information about the periods without the free credit. It has, however, a hard to miss downside – it does not allow us to aggregate all the top-ups, ie. what if we want to include all those €5 or €10 top-ups that might have also happened during promotional period into aggregates? This requires an alternative approach in which all the events are tagged respectively instead of filtered out. We will introduce such an approach in this post.
Finding the free call intervals can be compared to the process of a sessionization. You can understand the sessionization as finding sessions in time series by grouping the data into sessions. A simple approach is to apply fixed time interval on the periods of inactivity and if the inactivity lasts longer than the interval, cut the session and start a new one. It could be 30 minutes in the context of clickstreams of some web application, which might then use the individual sessions to make some statistics about the users’ behaviour. In the context of our use case, the interval is 28 days and we understand the free call interval as a session. The task is little more complicated since only some of the top ups determine the sessions.
Again, let’s break the query into simple, functional steps. The plan is as follows:

Firstly, we calculate the number of days since the previous €20+ top-up for each top-up.
Secondly, we use this information to mark the first top-up of each interval. Moreover, we determine the top-ups that belong to some interval to filter out the others.
Thirdly, we assign a tag to each top-up, so that all top-ups in the same interval have the same customer id and tag. Besides, we calculate the end dates of the intervals.
Finally, we group over the customer id and tag to produce arbitrary aggregation result over all top-ups in the interval. Lastly, we extract the starting and ending dates of intervals.

To demonstrate the use case of the top-ups aggregation, we print also a sum of all credits that a customer topped-up in each interval.

Step 1: Days since the previous qualifying top-up

First, we compute the number of days from the previous €20+ top-up. It’s similar to what we did before but this time we will include but ignore the top-ups with less than €20 value. Therefore, we can interpret the value as the number of days since the nearest previous top-up with more than €20.

CREATE TABLE topups_enriched AS

SELECT

topups.*

,date - MAX(CASE WHEN topup_value >= 20 THEN date END) OVER (

PARTITION BY user_id

ORDER BY date, top_up_seq ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING

) AS days_since_prev_qualifying_topup

FROM topups

ORDER BY user_id, date DESC, top_up_seq DESC;

The OVER clause creates a frame from the first row (still) till the previous one (moving), so it’s ordered by date. That means from the oldest top-up till the previous. The top_up_seq field ensures that the output will be deterministic even for multiple top-ups per day.
The CASE clause returns the date if the current top-up has a value of 20 or more. Otherwise, it returns null.
We search the MAX value in the frame because out of all dates marking the high top-ups, we want to pick relatively the most recent one. Of course, the function skips null values, hence we truly get the date of the last high top-up.

Output (we leave out topup_seq):

user_id	date	topup_value	days_since_prev_qualifying_topup
⋮
4	2017-06-17	25	18
4	2017-06-07	10	8
4	2017-06-05	5	6
4	2017-05-30	25	76
4	2017-04-22	10	38
4	2017-04-10	5	26
4	2017-03-15	25	28
4	2017-03-09	5	22
4	2017-02-25	15	10
4	2017-02-15	20	27
4	2017-01-25	5	6
4	2017-01-19	25	19
4	2017-01-02	10	2
4	2016-12-31	20	11
4	2016-12-20	20
⋮
5	2016-12-31	25	45
5	2016-12-11	5	25
5	2016-12-07	10	21
5	2016-11-17	15	1
5	2016-11-16	25	0
5	2016-11-16	20

We can check that the value for the customer with id 5 on 2016-12-07 is 21 (days from the high top-up 2016-11-16), and it is 25 on 2016-12-11 (again from 2016-11-16). Let’s track again the same free call intervals, coloured with the green and blue colour. As opposed to the previous approach, the intervals now also contain the top-ups with a value less than €20. As an example, see the top-up on 2017-04-10 for the customer with id 4: it also falls into the free call interval because it is within the 28 days since the last €20+ top-up of the interval (which is 2017-03-15). Therefore, these top-ups belong to the interval too and we will use them to compute the sum of all top-ups in an interval.

Step 2: Mark top-ups that belong to a free call interval

Based on the value from the previous step, we can determine which top-ups belong to some free call interval and which does not occur in any. The condition follows: if the top-up has value more than or equal to €20 or is no more than 28 days later than the previous qualifying top-up, then it belongs to some free call interval.
Besides, we will also mark the interval starting top-ups, which is going to be useful for the period sequence calculation. We know which top-up starts the interval because the value must be more than or equal to €20 and the previous qualifying top-up must be more than 28 days ago.
This gives two extra flags we will include in calculations:

CREATE TABLE topups_flags AS

SELECT

topups_enriched.*

-- belonging to some free call interval flag

,CASE WHEN NOT(COALESCE(days_since_prev_qualifying_topup, 30) > 28)

OR topup_value >= 20 THEN 1

END AS belongs_to_interval

-- first top-up of an interval flag

,CASE WHEN COALESCE(days_since_prev_qualifying_topup, 30) > 28

AND topup_value >= 20 THEN 1

END AS interval_start_flag

FROM topups_enriched

ORDER BY user_id, date DESC, top_up_seq DESC;

user_id	date	topup_value	days_since_prev_qualifying_topup	belongs_to_interval	interval_start_flag
4	2017-06-17	25	18	1
4	2017-06-07	10	8	1
4	2017-06-05	5	6	1
4	2017-05-30	25	76	1	1
4	2017-04-22	10	38
4	2017-04-10	5	26	1
4	2017-03-15	25	28	1
4	2017-03-09	5	22	1
4	2017-02-25	15	10	1
4	2017-02-15	20	27	1
4	2017-01-25	5	6	1
4	2017-01-19	25	19	1
4	2017-01-02	10	2	1
4	2016-12-31	20	11	1
4	2016-12-20	20		1	1
⋮
5	2016-12-31	25	45	1	1
5	2016-12-11	5	25	1
5	2016-12-07	10	21	1
5	2016-11-17	15	1	1
5	2016-11-16	25	0	1
5	2016-11-16	20		1	1

Now we can judge from the table that top-up of the customer with id 4 on 2017-04-22 is done outside fee call promotion. Also, in each coloured interval is correctly only one positive interval_start_flag and all belongs_to_interval flags are positive within both intervals.

Step 3: Tagging the top-ups

Now we are prepared to tag the top-ups so that all the top-ups with the same tag belong to the same period. A tag is represented by a sequence number of a free call interval for a customer. Besides, we will also compute the end dates of the intervals.

CREATE TABLE topups_tagged AS

SELECT

user_id

,date

,topup_value

-- 1. TAG: sequence number of the interval calculated as sum of interval_start_flag

,SUM(interval_start_flag) OVER (PARTITION BY user_id ORDER BY date, top_up_seq

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_seq

-- 2. end of the period computed as qualifying top-up plus 28 days

,MAX(CASE WHEN topup_value >= 20 THEN date + 28 END)

OVER (PARTITION BY user_id ORDER BY date, top_up_seq

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_end

FROM topups_flags

-- leaving out the out-of-interval top-ups

WHERE belongs_to_interval = 1

ORDER BY user_id, date DESC;

Notice that we filtered out the out-of-interval top-ups in the WHERE clause prior to other computations.

With the interval_seq field, we are numbering the intervals for each customer from 1. We will take advantage of a simple feature of the interval_start_flag: the sum of values in a single interval is one. Therefore, if we sum the values from the beginning, the result increments with every new interval.
- The frame size in the OVER clause is defined from the first row till the current one because we want to look only at the past and current top-ups to get the correct tag.
- The SUM function ignores null values, hence increments the result only with 1 value. This ensures that all top-ups that belong to the same interval get the same tag.
The interval_end computation is using CASE statement inside the MAX window function, which takes the most recent date out of the qualifying top-ups’ dates, plus 28 days, as an end of the period.
- The OVER clause is defined again from the first row till the current one.
- The CASE clause returns the current date plus 28 if the current top-up is qualifying. Importantly, when a top-up is not qualifying, then the CASE clause produces null value, which will lead the MAX function to ignore all non-qualifying top-ups.
- The result is the expected end of the interval under the condition that no other qualifying top-up pops up in the current interval at the future rows. Thus, only the date at the last top-up of the interval will be important for us.

Output:

user_id	date	topup_value	interval_seq	interval_end
⋮
4	2017-06-17	25	2	2017-07-15
4	2017-06-07	10	2	2017-06-27
4	2017-06-05	5	2	2017-06-27
4	2017-05-30	25	2	2017-06-27
4	2017-04-10	5	1	2017-04-12
4	2017-03-15	25	1	2017-04-12
4	2017-03-09	5	1	2017-03-15
4	2017-02-25	15	1	2017-03-15
4	2017-02-15	20	1	2017-03-15
4	2017-01-25	5	1	2017-02-16
4	2017-01-19	25	1	2017-02-16
4	2017-01-02	10	1	2017-01-28
4	2016-12-31	20	1	2017-01-28
4	2016-12-20	20	1	2017-01-17
⋮
5	2016-12-31	25	2	2017-01-28
5	2016-12-11	5	1	2016-12-14
5	2016-12-07	10	1	2016-12-14
5	2016-11-17	15	1	2016-12-14
5	2016-11-16	20	1	2016-12-14
5	2016-11-16	25	1	2016-12-14

We can see here that all rows which are supposed to be merged into a single interval (same colour, same seq value), expectedly contain the same value in the interval_seq column.

Step 4: Grouping top-ups into intervals

The last step is to group the top-ups by the customer and tag (interval_seq field). We will extract a start of an interval as the earliest top-up date in the interval. The end of the interval is calculated as the latest date from the column interval_end. The reason is that the cumulative character of the window functions have not allowed us to discover the end of the interval at the first row and we were updating it until the real end of the interval. The last column tells us the desired sum of all top-up values over whole free call interval.

SELECT

user_id

,interval_seq

,MIN(date) AS interval_start

,MAX(interval_end) AS interval_end

,SUM(topup_value) AS topup_sum

FROM topups_tagged

GROUP BY user_id, interval_seq

ORDER BY user_id, interval_start DESC;

Solve my Redshift Problems

Output:

user_id	interval_seq	interval_start	interval_end	topup_sum
1	3	2017-03-20	2017-05-15	40
1	2	2017-01-03	2017-02-25	65
1	1	2016-11-07	2017-01-02	55
2	1	2017-01-15	2017-02-12	40
3	4	2017-05-11	2017-06-08	30
3	3	2017-03-15	2017-05-03	45
3	2	2017-02-01	2017-03-01	30
3	1	2016-12-19	2017-01-16	25
4	2	2017-05-30	2017-07-15	65
4	1	2016-12-20	2017-04-12	150
5	4	2017-05-31	2017-07-13	65
5	3	2017-04-29	2017-05-27	30
5	2	2016-12-31	2017-02-23	60
5	1	2016-11-16	2016-12-14	75

As we can see, the sum of all top-ups in the first interval of the customer 5 is 75, which truly follows 5+10+15+25+20. You can check that the starts and ends of all intervals agree with the final intervals of the first approach.

All in one single query

Finally, let’s show the single query similarly as in the previous approach:

WITH topups_enriched AS (

SELECT

topups.*

,date - MAX(CASE WHEN topup_value >= 20 THEN date END) OVER (

PARTITION BY user_id

ORDER BY date, top_up_seq ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING

) AS days_since_prev_qualifying_topup

FROM topups

), topups_tagged AS (

SELECT

user_id

,date

,topup_value

-- TAG: sequence number of the interval calculated as sum of interval_start_flag

,SUM(CASE WHEN COALESCE(days_since_prev_qualifying_topup, 30) > 28 AND

topup_value>=20 THEN 1 END)

OVER (PARTITION BY user_id ORDER BY date, top_up_seq

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_seq

-- end of the period computed as qualifying top-up plus 28 days

,MAX(CASE WHEN topup_value >= 20 THEN date + 28 END)

OVER (PARTITION BY user_id ORDER BY date, top_up_seq

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS interval_end

FROM topups_enriched

-- leaving out the out-of-interval top-ups

WHERE NOT(COALESCE(days_since_prev_qualifying_topup, 30) > 28) OR topup_value >= 20

) SELECT

user_id

,interval_seq

,MIN(date) AS interval_start

,MAX(interval_end) AS interval_end

,SUM(topup_value) AS topup_sum

FROM topups_tagged

GROUP BY user_id, interval_seq

ORDER BY user_id, interval_start DESC;

In the previous approach, we left out the subquery that did not use any window functions (extra_flags). We apply the same rule in this case and omit the table topups_flags, which created flags only to illustrate the idea.

About the author:

Uli Bethke

Co-founder of Sonra

Any questions or comments for Uli? Connect with him on LinkedIn.

Follow Uli Bethke:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD