Window Functions (aka Analytic Functions) in Spark.

Uli Bethke analytic functions, Big Data, MapR, Spark, SparkSQL, SQL

As of Spark 1.4.0 we now have support for window functions (aka analytic functions) in SparkSQL. At Sonra we are heavy users of SparkSQL to handle data transformations for structured data. We also use it in combination with cached RDDs and Tableau for business intelligence and visual analytics.

Teach me Big Data to Advance my Career

Spark SQL and Window Functions: The rationale

I am a strong supporter of using SQL for performing ETL and data transformations of structured data. There are several reasons for this:

  • A lot of people know SQL. Similar to English in the real world it is the lingua franca of data.
  • Most requirements can be met with SQL for structured data. Admittedly, it is not so good when it comes to querying graphs, complex trees, non-structured data, and complex many to many relationships.
  • In a lot of cases you have to write less code using the SparkSQL abstraction layer.

Window functions allow you to perform inter-row calculations that in traditional SQL require ugly and costly self-joins, e.g. a moving average can be calculated in one line of code. If you want to learn more about window functions in SQL I would recommend the SQL Cookbook from O'Reilly.

An example

Merging the patch into older Spark versions

We are using version 1.3.1 of Spark on top of the MapR Hadoop distribution. As Spark 1.3.1 does not ship with the Window functions we merged the 1.4.0 patch into the 1.3.0 MapR Spark version. You can contact us for the assembly if you want to achieve the same (also for Spark 1.2.1),

About the author

Uli Bethke LinkedIn Profile

Uli has 18 years’ hands on experience as a consultant, architect, and manager in the data industry. He frequently speaks at conferences. Uli has architected and delivered data warehouses in Europe, North America, and South East Asia. He is a traveler between the worlds of traditional data warehousing and big data technologies.

Uli is a regular contributor to blogs and books and chairs the the Hadoop User Group Ireland. He is also a co-founder and VP of the Irish chapter of DAMA, a non for profit global data management organization. He has co-founded the Irish Oracle Big Data User Group.