Comparing Window Function Features by Database Vendors

Jiří Mauritz Data Warehouse, Redshift, SQL for Analysis, Window Functions

We will round off the series on window functions with comparison of what database vendors offer. There are various mutations of window functions and every vendor supports a different subset or feature. Some also add extra window functions or features beyond standard ANSI SQL. One of the most powerful features is user-defined aggregate functions (UDAF), which some databases allow using as window functions, giving us new possibilities and greater power.

Let's first introduce the features.  Let's then have a look at a matrix comparing feature with vendor offerings and some comments. Next section contains list of vendors together with links to the documentation on the window functions, and the last section inform you about some extra features of individual vendors.

Window Function Features

1. AGG&SEQ - Aggregate and sequencing functions

We examined what kind of functions can be used in the OVER clause as window functions. This group includes the main aggregation functions that you can use in the GROUP BY clause, plus sequencing and ranking functions, which produce output according to a position of the row in the table.

Aggregate:

  • MAX
  • MIN
  • SUM
  • AVG
  • COUNT

Sequencing and ranking:

  • ROW_NUMBER
  • RANK
  • DENSE_RANK
  • FIRST
  • LAST
  • NTH_VALUE
  • LEAD
  • LAG

2. ANALYTIC - Analytical/Statistical functions

All functions that give some valuable information resulting from the distribution of the data.

  • PERCENTILE_CONT
  • PERCENTILE_DISC
  • NTILE
  • CUME_DIST
  • PERCENT_RANK
  • MEDIAN
  • REGR_*

Note that the first two functions require the WITHIN GROUP clause instead of the OVER clause. We also include statistical functions calculating linear regression starting with REGR_ in this category.

Notice in the matrix below, that the MEDIAN function is often not supported. Standard aggregate functions, such as MAX or AVG, can be easily computed in one iteration over the values, whereas for the MEDIAN function, the engine needs to sort the values first (even though there are efficient algorithms for doing this). Therefore, the function is often not supported at all, rather than available but slow.

3. RANGE - Support for flexible frame definition

The ORDER clause of the window function can contain the keywords ROWS or RANGE. The numerical expression of rows is almost always supported in the ROWS clause. However, the numerical expression of range in the RANGE clause is often not supported because of the complexity of implementing it.

We consider this feature to be satisfied if the database allows numerical values inside the RANGE clause. Some of the databases allow using RANGE clause only with UNBOUNDED or CURRENT ROW. You will see a note in the matrix if this is the case. You can find more on this topic in the previous post Window function frames on Redshift and BigQuery.

4. DISTINCT - Distinct inside window function

Removes duplicate values before applying the window function. The syntax is the following:

Particularly useful in the context of COUNT(…) aggregate function. As it is expensive to calculate distinct values some engines, e.g. Oracle now support approximate distinct.

5. LISTAGG

LISTAGG (in Oracle) or STRING_AGG (in PostgreSQL) is a function that aggregates values into a string of characters, where the values are delimited by a specified separator. For instance, the query:

for each employee produces a comma separated list of all employees in his department.

We also include the function XMLAGG in this category, which aggregates values into XML.

6. ARRAYS - Aggregation into arrays

Aggregating values into an array using the ARRAY_AGG function. This is particularly useful for nesting and unnesting parent child relationships of very large tables.

7. CLAUSE - Clause for the WINDOW declaration

This feature enables users to define the window in a stand-alone clause. The statement prevents repetitions of the same window function definitions.

Syntax:

8. UDAF - User-defined aggregate functions

UDAFs allow users to create a custom aggregate function and, most importantly, allow users to use them as window functions. Most supported languages are Java, C, C++, C#. DB2 also allows COBOL, PostgreSQL allows Perl and Redshift requires Python. An example of a str_agg UDAF function in PostgreSQL using plain SQL:


Submit your e-mail address to download the matrix

A link to PDF was sent to your e-mail
Please specify a valid email

Matrix of features supported by vendors

Features /
Databases

AGG&SEQ

ANALYTIC

RANGE

DISTINCT

LISTAGG

ARRAYS

CLAUSE

UDAF

Oracle

Teradata

MS SQL

DB2

PostgreSQL

Hive/Spark

Apache Drill

Presto

Cloudera
Impala

Google
BigQuery

Amazon
Redshift

Snowflake

Vendors

For the comparison we have picked the most popular databases and also some trending vendors. The vendors listed below contain links leading straight to the window function section of their documentation.

Extra features

Oracle: LOOKUPS

The functions First and Last can be applied logically before the window function itself and allow us to print another field than we are aggregating over. They are used always together with the keyword KEEP and can be used also without window functions.

Syntax:

Oracle: MATCH_RECOGNIZE

Very useful feature for the analytics is the pattern matching in Oracle. With a lot of customizations, the MATCH_RECOGNIZE clause enables looking for a trend or pattern in the data, such as spikes, quick drop or periods. For instance, when we daily watch a price of a product, we might want to see all occurrences of the following series: rise, then five days without change and then drop. Let’s look at the code implementing this analysis:

The MEASURES clause defines what will be produced for each occurrence in the output, DEFINE clause allows us to define pattern variables and the PATTERN defines the trend we want to find. For thorough information on pattern matching, see Oracle documentation.

Teradata: RESET WHEN

The clause RESET WHEN can be placed after ORDER BY in the window function declaration and is always followed by a condition. If the condition is evaluated to true at some row, a new dynamic partition is created in addition to the partitions defined in the PARTITION BY clause. For example, we can count the days, when the price is rising and reset the counting whenever the price decreases or does not move:

And a possible output could look similar to this:

price

count

60

0

70

1

65

0

71

1

80

2

87

3

85

0

Find more on the RESET WHEN clause in the Teradata documentation.

Teradata: QUALIFY

In the previous posts on the window functions, we often needed to filter out some rows after the window functions had been applied. Since they operate after WHERE and HAVING clause, we had to use another SELECT as a subquery to filter them out.

The QUALIFY clause was made to overcome this limitation and works exactly the same way as HAVING clause for the traditional aggregations. See the example from the official documentation:

PostgreSQL: JSON_AGG

Function JSON_AGG, which aggregates values similarly as XMLAGG but into the JSON format.

Conclusion

Even though window functions are a clear and coherent concept, we can see a rich diversity in features implemented by the vendors. Usually, when a vendor have not implemented some feature, it is because there is not so large group of users, who would benefit. Moreover, full implementation of RANGE or MEDIAN requires excessive work on efficient algorithms. Also, often parallelization of such calculations pose serious challenges in the implementation (cf. count (distinct)). A lot of the vendors are, however, promising these features, hence should be delivered in the near future.

One can notice from the discussion forums that the features are discussed at the moment and the development is very alive. This is especially true for the newer databases, such as PrestoDB, Spark or Snowflake. Let’s hope the vendors find resources to provide us with the full set of the window functions available, as they are defined by the SQL language. Until then, let yourself be inspired from the matrix above when choosing the right vendor.