SQL parser for Databricks. Parsing SQL for table audit logging and much more

September 29, 2023

This is the fifth article in our series on parsing different SQL dialects. We explored SQL parsing on Snowflake, MS SQL Server, Oracle, and Redshift. in our earlier blog posts. We cover SQL parsing on DataBricks in this blog post. We take table and column audit logging as a use case for parsing SQL on Databricks.

We provide practical examples of interpreting SQL from the DataBricks query history. Additionally, we will present some code that utilises the FlowHigh SQL parser SDK to programmatically parse SQL from DataBricks. The parsing of SQL Server SQL can be automated using the SDK.

In another post on the Sonra blog, I go into great depth on the benefits of using an SQL parser. In this post we cover the use cases of an SQL parser for both data engineering and data governance use cases.

One example for a use case of an SQL parser is table and column audit logging. Audit logging refers to the detailed recording of access and operations performed on specific tables and columns in a database including execution of SQL queries. Such logging can be essential for ensuring security, compliance with regulatory standards, and understanding data access patterns.

SQL parser for Databricks

Sonra has created a powerful online SQL parser designed for any SQL dialect, including DataBricks. It is called FlowHigh. This SaaS platform includes an easy-to-use UI for manual SQL parsing as well as an SDK for managing bulk SQL parsing requirements or automating the operation. We demonstrate FlowHigh’s capabilities by parsing the query history of DataBricks. To programmatically parse the query history, we used the SDK.

Programmatically parsing the DataBricks query history with the FlowHigh SDK

Query history API

Databricks records every SQL query executed and retains this information in the query execution log for a period of 30 days. To retrieve this data, users can utilise a dedicated API endpoint: /api/2.0/sql/history/queries. This endpoint includes details such as the timestamp of execution and the user, among other metrics. You can extract the content of each query via the query_text attribute and also get the query_id as a unique identifier.

Unity Catalog and system tables

Databricks’ API-centric method for accessing the SQL query history is quite different from traditional relational database management systems such as Oracle, MS SQL Server, or other cloud data platforms such as Snowflake.

Databricks recently introduced a new feature called Unity Catalog for storing metadata. It stores metadata but is not very comprehensive in comparison to other platforms such as Snowflake

The Databricks Unity Catalog provides a centralised system to manage and monitor data assets. It offers a consolidated location for all data, ensuring it’s managed under set guidelines. Moreover, it diligently records every action taken on the data within the associated Databricks account, ensuring transparency and accountability.

Databricks currently supports a variety of system tables, each serving a distinct purpose:

  • Audit logs: This table captures a record of events and is found at the location system.access.audit.
  • Billable usage logs: Used to track billable activities, this table can be accessed at system.billing.usage.
  • Pricing table: For insights into pricing details, users can refer to the table at system.billing.list_prices.
  • Table and column lineage: Both these tables, which trace the source and flow of data, are housed under the directory system.access.
  • Marketplace listing access: This table, which is situated at system.marketplace.listing_access_events, tracks access events related to marketplace

System tables are controlled by Unity Catalog, we must have at least one workspace in our account that is enabled by Unity Catalog in order to enable and access system tables. System tables contain information from every workspace in the account, but only workspaces with the Unity Catalog feature can access them.

Programmatically parsing SQL from query history API

The below python code shows how query history is pulled from Databricks and processed using FlowHigh SDK:

Analysing the output of the FlowHigh SQL parser

An SQL query is ingested by the FlowHigh SQL parser for Databricks, which then returns the processed result either as a JSON or XML message. For instance, the parser produces a full JSON message of the SQL query from the query history we collected using the API. This output includes information on the filter conditions, fields fetched, aliases used, join conditions, tables and other components of the SQL statement.

Let’s go through an example

For illustration, the following is an example SELECT statement output in JSON:

Tables and columns

In the JSON representation, datasets are denoted as T1, T2, and T3, which are likely references or aliases for the tables used in the SQL query. Columns in the statement are labelled as C1, C2, C3, and C4. If a particular column has an alias, it is provided within the JSON. In cases where there’s no alias, the column is simply represented by its label, such as C1, C2, etc.

Joins

The analysis of the JSON representation reveals two inner join conditions. The initial join condition establishes a connection based on the attributes C1 and C5. Meanwhile, the subsequent join condition associates the attributes C6 and C7.

GROUP BY

The JSON shows that the data is grouped using two columns: C1 and C2.

FILTER

From the JSON data, it’s evident that the query uses certain filter conditions. Specifically, it selects entries where the column C1 matches a value of 456. It also filters the data to rows where the column C8 has dates ranging from ‘1994-01-01’ to ‘1995-01-01’. These criteria ensure that only pertinent records fitting these parameters are chosen.

ORDER BY

From the JSON data, we see that the query sorts its results using the ORDER BY clause in ascending order on column C2.

FlowHigh User Interface for SQL parsing

You can also access the FlowHigh SQL parser through the web based user interface. The below figure shows how FlowHigh provides the information about tables in a SQL query by grouping them into types of tables.

When we select a table name, it reveals the associated column names. For instance, by selecting the PART table, we can view its corresponding column names.

Likewise FlowHigh can be used to get columns used in a where conditions ,order by m group by and joins in the SQL query.

Columns used in GROUP BY / ORDER BY clause

This figure shows how FlowHigh can be used to filter out the columns used in order by and group by clause.

Filter columns

This figure shows how FlowHigh can be used to filter out the columns used in where clauses.

Columns in join conditions

This figure shows how FlowHigh can be used to filter out the columns used and type of join.

Need more?

FlowHigh ships with two other modules

  • FlowHigh SQL Analyser. The Analyser checks for issues in your SQL code. It detects 30+ anti patterns in SQL code

I have written up a blog post on automatically detecting bad SQL code.

  • FlowHigh SQL Visualiser. Visualising SQL queries helps understand complex queries. It lets developers see how each part of the query contributes to the final result, making it easier to understand and debug.

Do you want to dive hands-on into SQL parsing? Request access to the FlowHigh to parse Databricks SQL.