Comparing Window Function Features by Database Vendors
We will round off the series on window functions with comparison of what database vendors offer. There are various mutations of window functions and every vendor supports a different subset or feature. Some also add extra window functions or features beyond standard ANSI SQL. One of the most powerful features is user-defined aggregate functions (UDAF), which some databases allow using as window functions, giving us new possibilities and greater power.
Let’s first introduce the features. Let’s then have a look at a matrix comparing feature with vendor offerings and some comments. Next section contains list of vendors together with links to the documentation on the window functions, and the last section inform you about some extra features of individual vendors.
Window Function Features
1. AGG&SEQ – Aggregate and sequencing functions
We examined what kind of functions can be used in the OVER clause as window functions. This group includes the main aggregation functions that you can use in the GROUP BY clause, plus sequencing and ranking functions, which produce output according to a position of the row in the table.
- MAX
- MIN
- SUM
- AVG
- COUNT
- ROW_NUMBER
- RANK
- DENSE_RANK
- FIRST
- LAST
- NTH_VALUE
- LEAD
- LAG
2. ANALYTIC – Analytical/Statistical functions
All functions that give some valuable information resulting from the distribution of the data.
- PERCENTILE_CONT
- PERCENTILE_DISC
- NTILE
- CUME_DIST
- PERCENT_RANK
- MEDIAN
- REGR_*
Note that the first two functions require the WITHIN GROUP clause instead of the OVER clause. We also include statistical functions calculating linear regression starting with REGR_ in this category.
Notice in the matrix below, that the MEDIAN function is often not supported. Standard aggregate functions, such as MAX or AVG, can be easily computed in one iteration over the values, whereas for the MEDIAN function, the engine needs to sort the values first (even though there are efficient algorithms for doing this). Therefore, the function is often not supported at all, rather than available but slow.
3. RANGE – Support for flexible frame definition
The ORDER clause of the window function can contain the keywords ROWS or RANGE. The numerical expression of rows is almost always supported in the ROWS clause. However, the numerical expression of range in the RANGE clause is often not supported because of the complexity of implementing it.
We consider this feature to be satisfied if the database allows numerical values inside the RANGE clause. Some of the databases allow using RANGE clause only with UNBOUNDED or CURRENT ROW. You will see a note in the matrix if this is the case. You can find more on this topic in the previous post Window function frames on Redshift and BigQuery.
4. DISTINCT – Distinct inside window function
Removes duplicate values before applying the window function. The syntax is the following:
1 |
window_function (DISTINCT field_name) OVER (PARTITION BY … ORDER BY ...) |
Particularly useful in the context of COUNT(…) aggregate function. As it is expensive to calculate distinct values some engines, e.g. Oracle now support approximate distinct.
5. LISTAGG
LISTAGG (in Oracle) or STRING_AGG (in PostgreSQL) is a function that aggregates values into a string of characters, where the values are delimited by a specified separator. For instance, the query:
1 |
SELECT last_name, STRING_AGG(last_name, ', ') OVER (PARTITION BY dept_no) FROM employees; |
for each employee produces a comma separated list of all employees in his department.
We also include the function XMLAGG in this category, which aggregates values into XML.
6. ARRAYS – Aggregation into arrays
Aggregating values into an array using the ARRAY_AGG function. This is particularly useful for nesting and unnesting parent child relationships of very large tables.
7. CLAUSE – Clause for the WINDOW declaration
This feature enables users to define the window in a stand-alone clause. The statement prevents repetitions of the same window function definitions.
Syntax:
1 |
WINDOW name_of_the_window AS (PARTITIONED BY … ORDER BY ...) |
8. UDAF – User-defined aggregate functions
UDAFs allow users to create a custom aggregate function and, most importantly, allow users to use them as window functions. Most supported languages are Java, C, C++, C#. DB2 also allows COBOL, PostgreSQL allows Perl and Redshift requires Python. An example of a str_agg UDAF function in PostgreSQL using plain SQL:
1 2 3 4 5 6 7 |
CREATE OR REPLACE FUNCTION concat_ws_comma(text, ANYELEMENT) RETURNS text AS $ SELECT concat_ws(',', $1, $2) $ LANGUAGE sql; CREATE AGGREGATE str_agg (ANYELEMENT) ( sfunc = concat_ws_comma, stype = text); |
Matrix of features supported by vendors
Features / |
AGG&SEQ |
ANALYTIC |
RANGE |
DISTINCT |
LISTAGG |
ARRAYS |
CLAUSE |
UDAF |
Oracle |
||||||||
Teradata |
||||||||
MS SQL |
||||||||
DB2 |
||||||||
PostgreSQL |
||||||||
Hive/Spark |
||||||||
Apache Drill |
||||||||
Presto |
||||||||
Cloudera |
||||||||
Google |
||||||||
Amazon |
||||||||
Snowflake |
Vendors
For the comparison we have picked the most popular databases and also some trending vendors. The vendors listed below contain links leading straight to the window function section of their documentation.
- Oracle
- Teradata
- MSSQL
- DB2
- PostgreSQL
- Apache Hive / Spark
- Apache Drill
- Presto
- Cloudera Impala
- Google BigQuery
- Redshift
- Snowflake
Extra features
Oracle: LOOKUPS
The functions First and Last can be applied logically before the window function itself and allow us to print another field than we are aggregating over. They are used always together with the keyword KEEP and can be used also without window functions.
Syntax:
1 |
window_function() KEEP (DENSE_RANK FIRST/LAST ORDER BY expr) OVER (PARTITION BY … ORDER BY ...) |
Oracle: MATCH_RECOGNIZE
Very useful feature for the analytics is the pattern matching in Oracle. With a lot of customizations, the MATCH_RECOGNIZE clause enables looking for a trend or pattern in the data, such as spikes, quick drop or periods. For instance, when we daily watch a price of a product, we might want to see all occurrences of the following series: rise, then five days without change and then drop. Let’s look at the code implementing this analysis:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
SELECT * FROM price_history MATCH_RECOGNIZE ( PARTITION BY product ORDER BY tstamp MEASURES STRT.tstamp AS start_tstamp, LAST(UP.tstamp) AS peak_tstamp, LAST(DOWN.tstamp) AS end_tstamp, MATCH_NUMBER() AS mno ONE ROW PER MATCH AFTER MATCH SKIP TO LAST DOWN PATTERN (STRT UP+ FLAT{5} DOWN+) DEFINE UP AS UP.price > PREV(UP.price), FLAT AS FLAT.price = PREV(FLAT.price), DOWN AS DOWN.price < PREV(DOWN.price) ) MR ORDER BY MR.product, MR.start_tstamp; |
The MEASURES clause defines what will be produced for each occurrence in the output, DEFINE clause allows us to define pattern variables and the PATTERN defines the trend we want to find. For thorough information on pattern matching, see Oracle documentation.
Teradata: RESET WHEN
The clause RESET WHEN can be placed after ORDER BY in the window function declaration and is always followed by a condition. If the condition is evaluated to true at some row, a new dynamic partition is created in addition to the partitions defined in the PARTITION BY clause. For example, we can count the days, when the price is rising and reset the counting whenever the price decreases or does not move:
1 2 3 4 5 6 7 |
SELECT price, ROW_NUMBER() OVER (ORDER BY tstamp RESET WHEN price <= -- previous row SUM(balance) over (ORDER BY tstamp ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) ) - 1 as count FROM prices; |
And a possible output could look similar to this:
price |
count |
60 |
0 |
70 |
1 |
65 |
0 |
71 |
1 |
80 |
2 |
87 |
3 |
85 |
0 |
Find more on the RESET WHEN clause in the Teradata documentation.
Teradata: QUALIFY
In the previous posts on the window functions, we often needed to filter out some rows after the window functions had been applied. Since they operate after WHERE and HAVING clause, we had to use another SELECT as a subquery to filter them out.
The QUALIFY clause was made to overcome this limitation and works exactly the same way as HAVING clause for the traditional aggregations. See the example from the official documentation:
1 2 3 |
SELECT StoreID, SUM(profit) OVER (PARTITION BY StoreID) FROM facts QUALIFY SUM(profit) OVER (PARTITION BY StoreID) > 2; |
PostgreSQL: JSON_AGG
Function JSON_AGG, which aggregates values similarly as XMLAGG but into the JSON format.
Conclusion
Even though window functions are a clear and coherent concept, we can see a rich diversity in features implemented by the vendors. Usually, when a vendor have not implemented some feature, it is because there is not so large group of users, who would benefit. Moreover, full implementation of RANGE or MEDIAN requires excessive work on efficient algorithms. Also, often parallelization of such calculations pose serious challenges in the implementation (cf. count (distinct)). A lot of the vendors are, however, promising these features, hence should be delivered in the near future.
One can notice from the discussion forums that the features are discussed at the moment and the development is very alive. This is especially true for the newer databases, such as PrestoDB, Spark or Snowflake. Let’s hope the vendors find resources to provide us with the full set of the window functions available, as they are defined by the SQL language. Until then, let yourself be inspired from the matrix above when choosing the right vendor.