Create your own custom aggregate (UDAF) and window functions in Snowflake

Published on February 4, 2018
Updated on July 2, 2024

In this post we will show you how to create your own aggregate functions in Snowflake cloud data warehouse. This type of feature is known as a user defined aggregate function. Most big data frameworks such as Spark, Hive, Impala etc. let you create your own UDAFs. Also traditional databases such as Oracle or SQL Server have this feature. However, with these data stores we typically need to write Java code that we need to compile, which makes it awkward and time consuming to deploy.
With Snowflake it’s much easier. You can write these functions in plain Javascript. No need to compile them. No need for the DBAs to deploy them. It makes these functions so much more accessible to developers.

Why custom aggregate functions (UDAFs)?

Few developers know about user defined aggregate functions and even less use them. I think they are more widely used on big data frameworks such as Spark, Hive etc.
You might be wondering why you would need to create your own aggregate function. Here are some examples

Randomly select one of the values from a GROUP BY
Get the greatest common divisor (GCD)
Get the lowest common denominator (LCD)
Calculate Median (not supported by some databases)

Similar to functions in programming languages, user defined aggregate functions in the database world accept parameters (usually multiple rows) as input, perform some aggregate operations such as finding the maximum or average value of the input and then return the result of that operation.
In this post we will be working with our dataset of telecommunications top-ups.

UDAFs in Snowflake

Snowflake provides a number of pre-defined aggregate functions such as MIN, MAX, AVG and SUM for performing operations on a set of rows. A few database systems such as Oracle and SQL Server however allow you to define custom aggregate functions. This process is usually very tedious. Oracle for example requires you to write Java code than then has to be compiled and deployed by the DBAs. SQL Server requires custom CLR code written in C#.
Snowflake supports Javascript table UDFs which output an extra column containing the result of applying a UDF to your collection of rows. They are very simple to create using Javascript and even easier to deploy.
Let’s first create a table UDF (UDAF) which will output the multiplication of all values in the input:

create or replace function AGG_MULTIPLY(ins double)
   returns table (num double)
   language javascript
   as '{
   processRow: function (row, rowWriter) {
     this.cMult = this.cMult * row.INS;
   },
   finalize: function (rowWriter, i) {
    rowWriter.writeRow({NUM: this.cMult});
   },
   initialize: function() {
    this.cMult = 1;
   }}';

create or replace function AGG_MULTIPLY(ins double)

returns table (num double)

language javascript

as '{

processRow: function (row, rowWriter) {

this.cMult = this.cMult * row.INS;

finalize: function (rowWriter, i) {

rowWriter.writeRow({NUM: this.cMult});

initialize: function() {

this.cMult = 1;

}}';

You can see that to implement a table UDF we have to define a Javascript object with three functions:

initialize – in this function we create the constants that are stored by this object and can be updated by other functions
processRow – in this function we receive as the value for each row in input. We can also choose to output a row using the rowWriter
finalize – this function is called after the last row in the input is processed. Here we can do the final calculations and also output them as a new row

In the example we defined that the input is named “ins” which we can only refer to with capital letters inside Javascript code. The output of the function is a table with a column named “num” of type double.
We can use this UDAF to multiply the value of all top-ups a customer has made:

SELECT
    customer_id,
    res.num AS total_multiplied
FROM
     top_ups,
     TABLE(AGG_MULTIPLY(topup_value)
     OVER (PARTITION BY customer_id) ) AS res

SELECT

customer_id,

res.num AS total_multiplied

FROM

top_ups,

TABLE(AGG_MULTIPLY(topup_value)

OVER (PARTITION BY customer_id) ) AS res

Unfortunately, Snowflake does not support the use of table UDFs (UDAFs) with the GROUP BY clause. We had to use the TABLE function with our table UDF and the OVER clause for defining the input rows.
The result of our query is the following:

UDAFs with window function in Snowflake

While table UDFs cannot be natively utilized by window functions, there are some workarounds we will show you.
We will first show you a simple modification to use Snowflake UDAFs as window functions with a RANGE clause from UNBOUNDED PRECEDING and CURRENT ROW work. These are also called running aggregates.
We will create a function RUN_MULTIPLY which multiplies the value of all topups in the input rows (we will use it to multiply all of the topup values on the same date).
For every row in the input it will output the current (running) value of the multiplications:

create or replace function "RUN_MULTIPLY"(ins double)
    returns table (num double)
    language javascript
    as '{
    processRow: function (row, rowWriter) {
      this.cMult = this.cMult * row.INS;
      rowWriter.writeRow({NUM: this.cMult});
    },
    finalize: function (rowWriter, i) {
    },
    initialize: function() {
     this.cMult = 1;
    }}';

create or replace function "RUN_MULTIPLY"(ins double)

returns table (num double)

language javascript

as '{

processRow: function (row, rowWriter) {

this.cMult = this.cMult * row.INS;

rowWriter.writeRow({NUM: this.cMult});

finalize: function (rowWriter, i) {

initialize: function() {

this.cMult = 1;

}}';

Compared to AGG_MULTIPLY function you can notice that the finalize function does not output a row while the processRow function does.
Using a specific syntax we can apply this Javascript table UDF to every partition that is created based on the date column.

SELECT
    *
FROM
    top_ups ,
    TABLE(RUN_MULTIPLY(topup_value)
        OVER (
        PARTITION BY date
        ORDER BY topup_value ASC)) as num
ORDER BY
    date ASC, topup_value ASC ;

SELECT

FROM

top_ups ,

TABLE(RUN_MULTIPLY(topup_value)

OVER (

PARTITION BY date

ORDER BY topup_value ASC)) as num

ORDER BY

date ASC, topup_value ASC ;

In the output of this query you can see an extra column for each date (our partition key) that contains the value of our Javascript table UDF named RUN_MULTIPLY:

On date 2017-03-15 we had three topups by customers with CUSTOMER_ID 2,3,4. They topped up by an amount of 15,20,25 bitcoins. The running product for the topups on this day is 15, 300 (15*20), 7500 (15*20*25).

When developing the Javascript table UDFs we experienced some complications with data type conversions which will hopefully get improved. Specifically the Javascript table UDFs do not accept integers as inputs because integers do not exist as a datatype in Javascript. To combat this problem we had to change the fields with integer data type to be of type float.

Tell us what custom aggregate or window functions have you created. Leave a comment in the section below.

Enjoyed this post? Have a look at the other posts on our blog.
Contact us for Snowflake professional services.
We created the content in partnership with Snowflake.

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

From Code to Clarity: Visualizing SQL code for Documentation and Debugging

Create your own custom aggregate (UDAF) and window functions in Snowflake

Why custom aggregate functions (UDAFs)?

UDAFs in Snowflake

UDAFs with window function in Snowflake

Create your own custom aggregate (UDAF) and window functions in Snowflake

Why custom aggregate functions (UDAFs)?

UDAFs in Snowflake

UDAFs with window function in Snowflake

Related Articles

Iceberg Ahead! All you need to know about Snowflake’s Polaris Catalog

How to Parse & Flatten XML in Snowflake With Examples

Data Orchestration Deep Dive Snowflake Tasks. An Airflow replacement?

Cookies consent