Window Functions (aka Analytic Functions) in Spark.

Uli Bethke analytic functions, Big Data, MapR, Spark, SparkSQL, SQL

As of Spark 1.4.0 we now have support for window functions (aka analytic functions) in SparkSQL. At Sonra we are heavy users of SparkSQL to handle data transformations for structured data. We also use it in combination with cached RDDs and Tableau for business intelligence and visual analytics.

Teach me Big Data to Advance my Career

Spark SQL and Window Functions: The rationale

I am a strong supporter of using SQL for performing ETL and data transformations of structured data. There are several reasons for this:

  • A lot of people know SQL. Similar to English in the real world it is the lingua franca of data.
  • Most requirements can be met with SQL for structured data. Admittedly, it is not so good when it comes to querying graphs, complex trees, non-structured data, and complex many to many relationships.
  • In a lot of cases you have to write less code using the SparkSQL abstraction layer.

Window functions allow you to perform inter-row calculations that in traditional SQL require ugly and costly self-joins, e.g. a moving average can be calculated in one line of code. If you want to learn more about window functions in SQL I would recommend the SQL Cookbook from O'Reilly.

An example

Merging the patch into older Spark versions

We are using version 1.3.1 of Spark on top of the MapR Hadoop distribution. As Spark 1.3.1 does not ship with the Window functions we merged the 1.4.0 patch into the 1.3.0 MapR Spark version. You can contact us for the assembly if you want to achieve the same (also for Spark 1.2.1),