SQL,

Home » SQL » Redshift Window Functions Advanced Use Cases

Redshift Window Functions Advanced Use Cases

by Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.

Published on August 3, 2017
Updated on December 18, 2024

Merging time-based events into periods

Thanks to our previous posts about the window functions, Introduction to Window functions on Redshift and Data exploration with Window functions on Redshift, you should now be familiar with the most common functions that can be used in the OVER clause and how to apply them to your data. Today, we introduce more advanced use cases, which complexity would usually require computation in the application instead of SQL engine. We will show how to achieve the result in SQL using a combination of multiple window functions and other features of SQL.

Your subscription could not be saved. Please try again.

You're In! Welcome to FastForward Congratulations on successfully subscribing to the FastForward Data Engineering Newsletter! You're now part of a growing community of 15,000+ data engineers who are staying ahead in the ever-evolving world of data.

FlowForward.

All Things Data Engineering
Straight to Your Inbox!

We will work with the same dataset as in the first article. It is available here and contains information about mobile credit top-ups of five customers:

top_up_seq	user_id	date	topup_value
68	3	2017-06-28	15
67	5	2017-06-26	5
66	2	2017-06-20	5
65	1	2017-06-20	15
64	4	2017-06-17	25
63	5	2017-06-15	20
62	5	2017-06-12	5
61	4	2017-06-07	10
60	4	2017-06-05	5
59	2	2017-06-05	10
58	5	2017-06-02	10
57	5	2017-05-31	25
56	4	2017-05-30	25
55	3	2017-05-25	5
54	1	2017-05-22	10
⋮

In the scenario we go through today, the mobile operator runs promotions for pay as you go customers: if you top up for at least €20 you get free calls for the next 28 calendar days. Customers can apply the promotion multiple times within the 28 calendar days and extend the free call period. Our final goal is to compute start and end dates of free call intervals for each customer. Let’s see an example. The customer in our sample data has made three top-ups:

user_id	date	topup_value
1	2017-02-20	15
1	2017-01-28	25
1	2017-01-03	25

On 2017-01-03 the customer tops-up for €25 (above our threshold of €20). His free call interval begins on this date. He tops-up again for €25 on 2017-01-28, which is 25 days after the first top-up and therefore, still within the free call period. Hence, the free call interval extends for another 28 days and the interval end date is 2017-01-28 plus 28, which is 2017-02-25. Notice that the top-up on 2017-02-20 is below the free call threshold of €20. As a result, the free call period is not extended and ends on 2017-02-25 (28 days after the last top up above €20).
The free call interval starts on 2017-01-03 and ends on 2017-02-25.

user_id	interval_start
1	2017-01-03	2017-02-25

To better understand the process of merging periods into intervals, see the picture below. The first three top-ups are chronologically close enough to be merged into an interval, whereas the fourth top-up is not and starts a new interval. Let’s assume all the top-ups are of value €20 or more.

What makes this use case a challenge is the fact that for any number of intersecting periods (applying the top-up multiple times), we want to produce just one interval.
We will present the query in four steps, where the output of each step will be input for the next step. Outputs will be saved as SQL tables for clarity.

Initially, we include only top-ups of €20 or greater. For each top-up, we compute the number of days between the current top up and the previous/next top up.
Next, we will use these values to flag those rows with the start and end dates for the free call interval.
In the third step, we filter out those rows with top-ups that fall inside the free call period, which gives us two rows. One for the start and the other for the end of the free call interval. case top-ups. We also compute ending dates of intervals as its last top-up date plus 28 days.
We reduce the output from the previous step to just one row to get the start and end dates for each free call interval per customer.

Step 1: Calculating Top-Up Intervals with LAG/LEAD

First, let’s filter out top-ups with less than €20 and enrich the dataset to include columns with the days since the previous top-up and days to the next top-up.

CREATE TABLE topups_enriched AS
      SELECT
            user_id
           ,date
           ,topup_value
           ,date - LAG(date,1) OVER (PARTITION BY user_id ORDER BY date)
                  AS days_after_prev_20topup
           ,LEAD(date,1) OVER (PARTITION BY user_id ORDER BY date) - date
                  AS days_before_next_20topup
      FROM topups
      WHERE topup_value >= 20

We can simply retrieve the date differences using the LAG function, which looks at the previous row and LEAD function, which looks at the following row. Since we are ordering by date, previous and following rows correspond to the previous and following top-ups.

Solve my Redshift Problems

Result:

user_id	date	topup_ value	days_after_prev _20topup	days_before_next_20topup
⋮
4	2017-06-17	25	18
4	2017-05-30	25	76	18
4	2017-03-15	25	28	76
4	2017-02-15	20	27	28
4	2017-01-19	25	19	27
4	2016-12-31	20	11	19
4	2016-12-20	20		11
5	2017-06-15	20	15
5	2017-05-31	25	32	15
5	2017-04-29	20	93	32
5	2017-01-26	20	26	93
5	2016-12-31	25	45	26
5	2016-11-16	20	0	45
5	2016-11-16	25		0

Let’s have a look at two examples. The first is a free call interval of the customer with id 4, coloured in blue. This interval starts with his first top-up date (2016-12-20) and ends with the date 2017-03-15 plus 28 days, which is 2017-04-12. We know that if the previous top-up was more than 28 days ago, the current date is the starting date of a new interval. Similarly, if the next top-up is more than 28 days ahead, then we know that the current top-up is the last of the free call interval.
The second occurrence is for the customer with id 5 and is coloured in green. He topped-up €20 or more twice on the same day, but the next top-up is 45 days away, so we expect one free call interval from 2016-11-16 to 2016-12-14 (which is 2016-11-16 plus 28 days).

Step 2: Mark Starts and End Dates of Free Call Periods

Now let’s enrich the dataset even more. We will create 0/1 flags determining whether the current top-up marks the start or the end of an interval. Moreover, we will introduce a flag for the edge top-up, which is simply either start or end of any interval. We will call inner top-ups those that are not edge top-ups and delete them in the next step.

CREATE TABLE extra_flags AS
      SELECT topups_enriched.*
           ,CASE WHEN COALESCE(days_after_prev_20topup,30) > 28
                  OR COALESCE(days_before_next_20topup,30) > 28 THEN 1
                 ELSE 0
            END AS edge_topup_flg
           ,CASE WHEN COALESCE(days_after_prev_20topup,30) > 28 THEN 1
                 ELSE 0
            END AS starting_topup_flg
           ,CASE WHEN COALESCE(days_before_next_20topup,30) > 28 THEN 1
                 ELSE 0
            END AS ending_topup_flg
      FROM topups_enriched;

The CASE statement is a natural way how to create flags. Note the COALESCE function that returns the first argument and only if the first argument is null, then returns the second one. We are using COALESCE function to return flag “1” in the case of null values. The reason is that the null value in the days_after_prev_20topup column means that it is the very first top-up, which is for sure the start of a period. Analogously for the days_before_next_20topup column.
Result:

user_id	date	topup_ value	days_after_prev _20topup	days_before_next _20topup	edge_topup _flg	starting_topup _flg	ending_topup _flg
⋮
4	2017-06-17	25	18		1	0	1
4	2017-05-30	25	76	18	1	1	0
4	2017-03-15	25	28	76	1	0	1
4	2017-02-15	20	27	28	0	0	0
4	2017-01-19	25	19	27	0	0	0
4	2016-12-31	20	11	19	0	0	0
4	2016-12-20	20		11	1	1	0
5	2017-06-15	20	15		1	0	1
5	2017-05-31	25	32	15	1	1	0
5	2017-04-29	20	93	32	1	1	1
5	2017-01-26	20	26	93	1	0	1
5	2016-12-31	25	45	26	1	1	0
5	2016-11-16	20	0	45	1	0	1
5	2016-11-16	25		0	1	1	0

For the customer with id 4, the starting_topup_flag flag is true for the first date, and the ending flag is true for the top-up on 2017-03-15. Notice that the top-ups between them have both starting and ending flag false. Since we seek edge dates of intervals, the inner top-ups carry no useful information for us and they need to be filtered out.

Step 3: Filtering inner top-ups and computing ends

In the next step, we filter out the inner top-ups, as we discussed earlier, and compute the end dates of the intervals.

CREATE TABLE full_ranges AS
      SELECT
           user_id
          ,topup_value
          ,date AS interval_start
          ,CASE ending_topup_flg WHEN 1 THEN date + 28
                 ELSE 28 + LEAD(date) OVER (PARTITION BY user_id ORDER BY date)
           END AS interval_end
          ,starting_topup_flg
          ,ending_topup_flg
      FROM extra_flags
      WHERE edge_topup_flg = 1;

The end dates are computed as the next date in the sequence plus 28 days. Here, it is crucial to understand that the sequence already lacks the inner top-ups (see WHERE clause), and therefore, the output contains only first and last top-up for each free call interval. Now the interesting computation happens at rows with the first top-up of the interval: the LEAD function adds 28 days to the date from the following row, which is the last top-up of the interval. Indeed, this calculates correctly the end date of the interval.
An exception applies for the intervals consisting of only one top-up, indicated by both positive starting and positive ending flag, which ends exactly 28 days after their start.
Result:

user_id	topup_value	interval_start	interval_end	starting_topup_flg	ending_topup_flg
⋮
4	25	2017-06-17	2017-07-15	0	1
4	25	2017-05-30	2017-07-15	1	0
4	25	2017-03-15	2017-04-12	0	1
4	20	2016-12-20	2017-04-12	1	0
5	20	2017-06-15	2017-07-13	0	1
5	25	2017-05-31	2017-07-13	1	0
5	20	2017-04-29	2017-05-27	1	1
5	20	2017-01-26	2017-02-23	0	1
5	25	2016-12-31	2017-02-23	1	0
5	20	2016-11-16	2016-12-14	0	1
5	25	2016-11-16	2016-12-14	1	0

Now we can see that both the green and blue areas have only two rows, as only edge top-ups remained. Moreover, the first blue row with the positive starting flag already contains the end of the interval (2017-04-12), which agrees with our expectation.

Step 4: Final start/end date extraction

The query which computes the final output is simple: we pick the rows with the positive starting_topup_flg flag. The ends of the periods are already present in the interval_end field, and therefore, we can ignore rows with the negative starting flag.

SELECT
      user_id
     ,interval_start
     ,interval_end
FROM full_ranges
WHERE starting_topup_flg = 1
ORDER BY user_id, interval_start DESC;

Result:

user_id	interval_start	interval_end
1	2017-03-20	2017-05-15
1	2017-01-03	2017-02-25
1	2016-11-07	2017-01-02
2	2017-01-15	2017-02-12
3	2017-05-11	2017-06-08
3	2017-03-15	2017-05-03
3	2017-02-01	2017-03-01
3	2016-12-19	2017-01-16
4	2017-05-30	2017-07-15
4	2016-12-20	2017-04-12
5	2017-05-31	2017-07-13
5	2017-04-29	2017-05-27
5	2016-12-31	2017-02-23
5	2016-11-16	2016-12-14

The blue and green areas were reduced into one row, representing the free call interval. Notice that the dates match with our expectation.

All in one single query

To achieve the final result, we have created three SQL tables so that the process was more understandable. Naturally, we can reach the same result with a single, consolidated query:

WITH topups_enrich AS (
      SELECT
            user_id
           ,topup_value
           ,date
            -- check if the next top-up is more than 28 days away -> start of the period
           ,COALESCE(date - LAG(date,1) OVER (PARTITION BY user_id ORDER BY date),30) > 28
                  AS starting_topup_flg
            -- check if the previous top-up is more than 28 days away -> end of the period
           ,COALESCE(LEAD(date,1) OVER (PARTITION BY user_id ORDER BY date) - date,30)>28
                  AS ending_topup_flg
      FROM topups
      WHERE topup_value>=20
), full_ranges AS(
      SELECT
            te.*
            -- add information about end of the period as (date in the next row + 28)
           ,CASE WHEN ending_topup_flg THEN date
                 ELSE LEAD(date) OVER (PARTITION BY user_id ORDER BY date)
            END + 28 AS interval_end
      FROM topups_enrich AS te
      -- filter out non-edge top-ups (neither starting nor ending)
      WHERE starting_topup_flg OR ending_topup_flg
)
-- range information is complete -> keep only starting rows
SELECT user_id, date AS interval_start, interval_end
FROM full_ranges
WHERE starting_topup_flg
ORDER BY user_id, interval_start DESC;

Notice that the table extra_flags was only for illustrative purposes and we can get by with the conditions that appeared in the flag definitions. Is it possible to reduce the query even more? Unfortunately not. So what is the difference that makes the extra_flags table needless? Here, we can present a rule of thumb which says that the number of cascading queries is dependent only on the cases where an output of a window function is to be used by another window function. Every cascading dependency requires a new subquery. The SQL language does not allow us to nest the window functions or use an output of one in another in the same SELECT query.
In the query above, the first subquery produces the flags using window functions, which are then used by the second subquery to filter the rows and apply next window function on the result. Clearly, we are not able to combine the logic into a single query. The reason for the final SELECT query is that the WHERE clause logically operates before the window functions and we need a final filtering. Note that no window function was used to create the extra_flags table, and therefore, we can carry its logic to other subqueries.

About the author:

Uli Bethke

Co-founder of Sonra

Any questions or comments for Uli? Connect with him on LinkedIn.

Follow Uli Bethke:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

Redshift Window Functions Advanced Use Cases

Merging time-based events into periods

FlowForward.

Step 1: Calculating Top-Up Intervals with LAG/LEAD

Step 2: Mark Starts and End Dates of Free Call Periods

Step 3: Filtering inner top-ups and computing ends

Step 4: Final start/end date extraction

All in one single query

Uli Bethke

Redshift Window Functions Advanced Use Cases

Merging time-based events into periods

FlowForward.

Step 1: Calculating Top-Up Intervals with LAG/LEAD

Step 2: Mark Starts and End Dates of Free Call Periods

Step 3: Filtering inner top-ups and computing ends

Step 4: Final start/end date extraction

All in one single query

Related Articles

Cookies consent