Data Exploration with Window Functions on Redshift

Jiří Mauritz Data Warehouse, Redshift, Window Functions

We have already introduced the main concept, syntax and simple examples of window functions applied to practical problems. In this post, we will go through some more advanced window functions and aim our focus on analytical use cases.

The dataset we will work with consists of information about phone calls and internet usage of two users. For each call, we know its duration in seconds and if the call was incoming or outgoing. In the latter case, we also know the phone number of the counterpart. We also have the information about the amount of data downloaded (measured in kB). Let’s see a sample:

Loading data from S3 to Redshift

The dataset is available through the following link: calls.tsv. The following code shows how to create a schema for our data and how to load the dataset from S3 storage to Redshift:

MAX OVER() to identify Longest call

In the first exercise let’s print all calls together with the longest call that the user has ever made with the current counterpart. In addition, we would like to know the longest call in the current day to compare. Worth noting the important feature of the window functions: each OVER clause can use different aggregation independently. Let’s use aggregation over both the phone and date to demonstrate this feature. We will skip the incoming calls and internet usage.

Output:

Rows with the same phone number are in the same colour. Notice that they share the same number_longest time. The duration time in bold marks the calls that were longest on a particular day. See the comparison with the longest_today column.

The aggregate function MAX together with the partitioning by phone clause computes the longest call made by a user and puts this against all user's calls so we have this repeating extra column available for various comparisons. Using the traditional GROUP BY aggregation would require self-joining tables.

To extract a date from a timestamp, we use TRUNC function in the query. This expression is valid only for Redshift and equivalent in Postgres is DATE_TRUNC('day', date_time). Also note that to convert the number of seconds in INT to INTERVAL, we concatenate the number with ' secs'string and then cast to INTERVAL with a double colon.

SUM function: Total duration over phone numbers

Let's find out who receives the longest calls. In other words, we are interested in the total sum of call durations for the particular counterpart. Let’s also print the percentage of this sum relative to all user’s calls.

Let’s understand the query starting from the inner core section:

  • The total duration over user and phone is computed by the window function SUM and partitioned by the mentioned fields. The same is applied to INTERVAL and INT type.
  • To compute the percentage, we use the previously mentioned expression in the numerator and sum over all calls in the denominator. The type of duration is INT, which would cause the division to give integer result, and therefore we need to cast the denominator to FLOAT before the division. After the division, the result is multiplied by 100 and rounded to get the percentage.
  • The CASE clause checks if the total duration is greater than 0 so that we do not divide by zero. If the condition is false, there are only calls of zero length for the phone number, hence we return 0 as the total duration in this case.
  • The WHERE clause collects only rows with outgoing calls since we do not know the phone number of incoming calls.

Window functions produce an output row for each call, but similar to the previous example all rows with the same phone number contain the same aggregated information.

Output:

We can discover from the percentage column for example, that the first user spent almost half of his calling time by calling number 49508432274. Some other numbers took less than 1 percent of total call time.

Lead function: Periods of inactivity

Anytime we need to calculate values based on neighbouring rows, window functions should be the first thing that come to mind. A good example for our dataset would be periods of inactivity between calls. Let’s say we want to see all periods of inactivity longer than 2 days for each user. We will consider internet usage as an instant activity with no duration and calls as an activity with the specified duration. This scenario is also very useful for identification and segmentation of sessions in a clickstream data, also known as sessionization.

To get starts and ends of the periods, we will take advantage of the LEAD function, which behaves as the LAG function, however, instead of the previous rows, it looks up the following ones.

The beginning of a period is computed as a sum of the starting date, time and duration of the call. The COALESCE function ensures that for the internet usage, the duration is zero since the function returns 0 anytime duration field is null. Period of inactivity ends when the following activity begins, therefore LEAD function extracts time of the next activity in the sequence ordered by date and time partitioned by a user. The next two columns are just differences of the first two columns in INT and INTERVAL type.

The outer SELECT puts a threshold on the period length and filters only periods longer than 2 days. The following table shows the output of the query. Notice the last periods for each user - both of them did not use the phone for more than a week.

Ntile function: Producing Quartiles

If a dataset becomes large enough, we could be interested in some statistical measures. The window function NTILE(num_buckets) enables us to compute quantiles. It divides the partition as equally as possible by assigning numbers 1..num_buckets to rows. Let’s see an example for the following new dataset of six values: [12,3,65,78,2,9]

The second and third column show how NTILE function assigns numbers to all values so that each bucket has an equal number of rows. In the second column, clause NTILE(2) ensures that the values are divided into two buckets, represented by numbers 1 and 2. The same principle is applied in the third column but now with three buckets and numbers 1,2,3. The buckets are, however, unrelated to the values itself and the assignment is basically random. The last column specifies ORDER BY clause, which forces the function to number ordered sequence of values. Therefore, lower numbers are assigned to the lower values.

When the ORDER BY clause is specified, we can interpret the output as division into quantiles taking data as the split criteria. In the case of internet usage, the task is to find quartiles of the data usage distribution. Let’s see the result of the NTILE function together with ordering by data.

Code:

Output:

Neither the date nor time is important now, but we can see that the rows are ordered by data_usage, and therefore, the quartiles can be computed as follows:

Using aggregate functions MIN and MAX we compute the range of the quartile intervals.

Besides quartiles, we can derive the median from the output as (1303 + 1321) / 2 = 1312. Worth noting the alternative, simpler way to compute median by using percentile_cont(fraction) function together with clause WITHIN GROUP:

This code gives us median 1301.5, so which one is correct? The percentile_cont computes the correct median, but what went wrong with the first approach? The answer lies behind the algorithm of NTILE function. Let’s see the result of the NTILE function with 2 buckets instead of 4:

Now the median truly follows (1300 + 1303) / 2 = 1301.5. For a better understanding, we will print counts of the rows assigned to quartiles by NTILE function.

We can see that whenever it is not possible to make buckets of equal size, the NTILE function assigns more rows to the top buckets. That is why the first two groups contain 934 rows, whereas the other two groups contain only 932. Then the median is incorrectly computed from values shifted by one row. Of course, when the number of buckets is 2, both the groups contain 933 rows.

Redshift has an even simpler syntax for computing median since MEDIAN is one of the window functions.

Note that we limit the output to one line only because window functions always return a result for all rows, which would create unnecessary duplicates. Oracle supports the MEDIAN function as well but for example, PostgreSQL does not.