SQL Parser,

Home » SQL Parser » SQL parser for Databricks. Parsing SQL for table audit logging and much more

SQL parser for Databricks. Parsing SQL for table audit logging and much more

by Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.

Published on September 29, 2023
Updated on January 27, 2025

This is the fifth article in our series on parsing different SQL dialects. We explored SQL parsing on Snowflake, MS SQL Server, Oracle, and Redshift. in our earlier blog posts. We cover SQL parsing on DataBricks in this blog post. We take table and column audit logging as a use case for parsing SQL on Databricks.

We provide practical examples of interpreting SQL from the DataBricks query history. Additionally, we will present some code that utilises the FlowHigh SQL parser SDK to programmatically parse SQL from DataBricks. The parsing of SQL Server SQL can be automated using the SDK.

In another post on the Sonra blog, I go into great depth on the benefits of using an SQL parser. In this post we cover the use cases of an SQL parser for both data engineering and data governance use cases.

One example for a use case of an SQL parser is table and column audit logging. Audit logging refers to the detailed recording of access and operations performed on specific tables and columns in a database including execution of SQL queries. Such logging can be essential for ensuring security, compliance with regulatory standards, and understanding data access patterns.

SQL parser for Databricks

Sonra has created a powerful online SQL parser designed for any SQL dialect, including DataBricks. It is called FlowHigh. This SaaS platform includes an easy-to-use UI for manual SQL parsing as well as an SDK for managing bulk SQL parsing requirements or automating the operation. We demonstrate FlowHigh’s capabilities by parsing the query history of DataBricks. To programmatically parse the query history, we used the SDK.

Programmatically parsing the DataBricks query history with the FlowHigh SDK

Query history API

Databricks records every SQL query executed and retains this information in the query execution log for a period of 30 days. To retrieve this data, users can utilise a dedicated API endpoint: /api/2.0/sql/history/queries. This endpoint includes details such as the timestamp of execution and the user, among other metrics. You can extract the content of each query via the query_text attribute and also get the query_id as a unique identifier.

Unity Catalog and system tables

Databricks’ API-centric method for accessing the SQL query history is quite different from traditional relational database management systems such as Oracle, MS SQL Server, or other cloud data platforms such as Snowflake.

Databricks recently introduced a new feature called Unity Catalog for storing metadata. It stores metadata but is not very comprehensive in comparison to other platforms such as Snowflake

The Databricks Unity Catalog provides a centralised system to manage and monitor data assets. It offers a consolidated location for all data, ensuring it’s managed under set guidelines. Moreover, it diligently records every action taken on the data within the associated Databricks account, ensuring transparency and accountability.

Databricks currently supports a variety of system tables, each serving a distinct purpose:

Audit logs: This table captures a record of events and is found at the location system.access.audit.
Billable usage logs: Used to track billable activities, this table can be accessed at system.billing.usage.
Pricing table: For insights into pricing details, users can refer to the table at system.billing.list_prices.
Table and column lineage: Both these tables, which trace the source and flow of data, are housed under the directory system.access.
Marketplace listing access: This table, which is situated at system.marketplace.listing_access_events, tracks access events related to marketplace

System tables are controlled by Unity Catalog, we must have at least one workspace in our account that is enabled by Unity Catalog in order to enable and access system tables. System tables contain information from every workspace in the account, but only workspaces with the Unity Catalog feature can access them.

Your subscription could not be saved. Please try again.

You're In! Welcome to FastForward Congratulations on successfully subscribing to the FastForward Data Engineering Newsletter! You're now part of a growing community of 15,000+ data engineers who are staying ahead in the ever-evolving world of data.

FlowForward.

All Things Data Engineering
Straight to Your Inbox!

Programmatically parsing SQL from query history API

The below python code shows how query history is pulled from Databricks and processed using FlowHigh SDK:

import requests

from flowhigh.utils.converter import FlowHighSubmissionClass

import pandas as pd

def get_databricks_query_history(base_url, token, max_results=100, include_metrics=False):

endpoint = f"{base_url}/api/2.0/sql/history/queries"

headers = {

"Authorization": f"Bearer {token}",

"Content-Type": "application/json",

}

results = []

page_token = None

has_next_page = True

while has_next_page:

params = {

"max_results": max_results,

"include_metrics": include_metrics,

}

if page_token:

params["page_token"] = page_token

response = requests.get(endpoint, headers=headers, params=params)

response_data = response.json()

if response.status_code != 200:

print(

f"Error {response.status_code}: {response_data.get('error_code', '')} - {response_data.get('message', '')}")

return results

results.extend(response_data["res"])

page_token = response_data.get("next_page_token")

has_next_page = response_data.get("has_next_page", False)

return results

data =[]

if __name__ == "__main__":

BASE_URL = "BASE_URL"

TOKEN = "<TOKEN>"

query_history = get_databricks_query_history(BASE_URL, TOKEN)

for query in query_history:

fh = FlowHighSubmissionClass.from_sql(query['query_text'])

xml_msg = fh.xml_message

entry ={ 'id': query['query_id'],'query':query['query_text'],'xml':xml_msg }

data.append(entry)

data_df = pd.DataFrame(data)

data_df.to_sql(fh_table_name, conn, if_exists='append', index=False)

Analysing the output of the FlowHigh SQL parser

An SQL query is ingested by the FlowHigh SQL parser for Databricks, which then returns the processed result either as a JSON or XML message. For instance, the parser produces a full JSON message of the SQL query from the query history we collected using the API. This output includes information on the filter conditions, fields fetched, aliases used, join conditions, tables and other components of the SQL statement.

Let’s go through an example

SELECT p_partkey

,p_name

,SUM(l_extendedprice*(1-l_discount)) AS revenue

FROM part

JOIN lineitem

ON p_partkey=l_partkey

JOIN orders

ON l_orderkey=o_orderkey

WHERE p_partkey=456

AND o_orderdate BETWEEN '1994-01-01' AND '1995-01-01'

GROUP BY p_partkey

,p_name

ORDER BY p_name;

For illustration, the following is an example SELECT statement output in JSON:

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

{

"0": {

"pos": "0-320",

"ds": [

{

"pos": "0-320",

"type": "root",

"subType": "inline",

"out": {

"exprs": [

{

"pos": "7-11",

"oref": "C1",

"eltype": "attr"

{

"pos": "24-8",

"oref": "C2",

"eltype": "attr"

{

"type": "fagg",

"pos": "39-46",

"alias": "revenue",

"name": "SUM",

"exprs": [

{

"pos": "43-30",

"exprs": [

{

"pos": "43-15",

"oref": "C3",

"eltype": "attr"

{

"pos": "60-12",

"exprs": [

{

"value": 1,

"eltype": "const"

{

"pos": "62-10",

"oref": "C4",

"eltype": "attr"

}

"type": "PLUS",

"eltype": "op"

}

"type": "MULTI",

"eltype": "op"

}

"eltype": "func"

}

"eltype": "out"

"in": {

"exprs": [

{

"pos": "91-6",

"alias": "p",

"oref": "T1",

"eltype": "ds"

{

"type": "inner",

"definedAs": "explicit",

"ds": {

"pos": "103-10",

"alias": "l",

"oref": "T2",

"eltype": "ds"

"op": {

"pos": "119-23",

"exprs": [

{

"pos": "119-11",

"oref": "C1",

"eltype": "attr"

{

"pos": "131-11",

"oref": "C5",

"eltype": "attr"

}

"type": "EQ",

"eltype": "op"

"eltype": "join"

{

"type": "inner",

"definedAs": "explicit",

"ds": {

"pos": "148-8",

"alias": "o",

"oref": "T3",

"eltype": "ds"

"op": {

"pos": "162-25",

"exprs": [

{

"pos": "162-12",

"oref": "C6",

"eltype": "attr"

{

"pos": "175-12",

"oref": "C7",

"eltype": "attr"

}

"type": "EQ",

"eltype": "op"

"eltype": "join"

}

"eltype": "in"

"modifiers": [

{

"type": "filtreg",

"op": {

"pos": "194-73",

"exprs": [

{

"pos": "194-15",

"exprs": [

{

"pos": "194-11",

"oref": "C1",

"eltype": "attr"

{

"value": 456,

"eltype": "const"

}

"type": "EQ",

"eltype": "op"

{

"pos": "215-52",

"exprs": [

{

"pos": "215-13",

"oref": "C8",

"eltype": "attr"

{

"value": "'1994-01-01'",

"eltype": "const"

{

"value": "'1995-01-01'",

"eltype": "const"

}

"type": "BETWEEN",

"eltype": "op"

}

"type": "AND",

"eltype": "op"

"eltype": "filter"

{

"type": "aggreg",

"exprs": [

{

"pos": "277-11",

"oref": "C1",

"eltype": "attr"

{

"pos": "294-8",

"oref": "C2",

"eltype": "attr"

}

"eltype": "agg"

{

"exprs": [

{

"pos": "312-8",

"direction": "asc",

"oref": "C2",

"eltype": "attr"

}

"eltype": "sort"

}

"eltype": "ds"

}

"antiPatterns": [

{

"type": "AP_11",

"pos": [

"43-15",

"62-10"

"eltype": "antiPattern"

{

"type": "AP_08",

"pos": [

"39-46"

"eltype": "antiPattern"

}

"eltype": "statement"

}

Tables and columns

In the JSON representation, datasets are denoted as T1, T2, and T3, which are likely references or aliases for the tables used in the SQL query. Columns in the statement are labelled as C1, C2, C3, and C4. If a particular column has an alias, it is provided within the JSON. In cases where there’s no alias, the column is simply represented by its label, such as C1, C2, etc.

Joins

The analysis of the JSON representation reveals two inner join conditions. The initial join condition establishes a connection based on the attributes C1 and C5. Meanwhile, the subsequent join condition associates the attributes C6 and C7.

{

"type": "inner",

"op": {

"exprs": [

{

"oref": "C1",

"eltype": "attr"

{

"oref": "C5",

"eltype": "attr"

}

"type": "EQ",

"eltype": "op"

}

{

"type": "inner",

"op": {

"exprs": [

{

"oref": "C6",

"eltype": "attr"

{

"oref": "C7",

"eltype": "attr"

}

"type": "EQ",

"eltype": "op"

}

GROUP BY

The JSON shows that the data is grouped using two columns: C1 and C2.

{

"type": "aggreg",

"exprs": [

{

"oref": "C1",

"eltype": "attr"

{

"oref": "C2",

"eltype": "attr"

}

"eltype": "agg"

}

FILTER

From the JSON data, it’s evident that the query uses certain filter conditions. Specifically, it selects entries where the column C1 matches a value of 456. It also filters the data to rows where the column C8 has dates ranging from ‘1994-01-01’ to ‘1995-01-01’. These criteria ensure that only pertinent records fitting these parameters are chosen.

{

"type": "filtreg",

"op": {

"exprs": [

{

"pos": "175-13",

"exprs": [

{

"oref": "C1"

{

"value": 456

}

"type": "EQ"

{

"pos": "194-50",

"exprs": [

{

"oref": "C8"

{

"value": "'1994-01-01'"

{

"value": "'1995-01-01'"

}

"type": "BETWEEN"

}

"type": "AND"

}

ORDER BY

From the JSON data, we see that the query sorts its results using the ORDER BY clause in ascending order on column C2.

{

"exprs": [

{

"pos": "285-6",

"direction": "asc",

"oref": "C2",

"eltype": "attr"

}

"eltype": "sort"

}

FlowHigh User Interface for SQL parsing

You can also access the FlowHigh SQL parser through the web based user interface. The below figure shows how FlowHigh provides the information about tables in a SQL query by grouping them into types of tables.

When we select a table name, it reveals the associated column names. For instance, by selecting the PART table, we can view its corresponding column names.

Likewise FlowHigh can be used to get columns used in a where conditions ,order by m group by and joins in the SQL query.

Columns used in GROUP BY / ORDER BY clause

This figure shows how FlowHigh can be used to filter out the columns used in order by and group by clause.

Filter columns

This figure shows how FlowHigh can be used to filter out the columns used in where clauses.

Columns in join conditions

This figure shows how FlowHigh can be used to filter out the columns used and type of join.

Need more?

FlowHigh ships with two other modules

FlowHigh SQL Analyser. The Analyser checks for issues in your SQL code. It detects 30+ anti patterns in SQL code

I have written up a blog post on automatically detecting bad SQL code.

FlowHigh SQL Visualiser. Visualising SQL queries helps understand complex queries. It lets developers see how each part of the query contributes to the final result, making it easier to understand and debug.

Do you want to dive hands-on into SQL parsing? Request access to the FlowHigh to parse Databricks SQL.

About the author:

Uli Bethke

Co-founder of Sonra

Any questions or comments for Uli? Connect with him on LinkedIn.

Follow Uli Bethke:

Back to Blog

Cookie	Duration	Description
__cfruid	session	Cloudflare sets this cookie to identify trusted web traffic.
cookielawinfo-checkbox-marketing	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Marketing".
cookielawinfo-checkbox-necessary	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-preferences	1 month	This cookie is set by the GDPR Cookie Consent plugin to check if the user has given consent to use cookies under the "Preferences" category.
cookielawinfo-checkbox-statistics	1 month	This cookie is set by the GDPR Cookie Consent plugin to store the user consent for the cookies in the category "Statistics".
cookielawinfo-checkbox-unclassified	1 month	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Unclassified".
CookieLawInfoConsent	1 month	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
csrftoken	1 year	This cookie is associated with Django web development platform for python. Used to help protect the website against Cross-Site Request Forgery attacks

Cookie	Duration	Description
AnalyticsSyncHistory	1 month	Linkedin set this cookie to store information about the time a sync took place with the lms_analytics cookie.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
li_gc	2 years	Linkedin set this cookie for storing visitor's consent regarding using cookies for non-essential purposes.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
mgref	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
mgrefby	1 year	This cookie is set by Eventbrite to deliver content tailored to the end user's interests and improve content creation. It is also used for event-booking purposes.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Cookie	Duration	Description
G	1 year	Cookie used to facilitate the translation into the preferred language of the visitor.
SERVERID	session	This cookie is set by Slideshare's HAProxy load balancer to assign the visitor to a specific server.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_7H38LVR4Z5	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_44804396_1	1 minute	Set by Google to distinguish users.
_gat_UA-44804396-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
SIDCC	6 Months	The "SIDCC" cookie is used as security measure to protect users data from unauthorised access
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
AN	1 month
AS	session
ebEventToTrack	1 month
eblang	1 year
SNID	2 years	This cookie is set by the Google. This cookie is used by the map which helps visitors to identify and reach the facility.
SP	session
SS	session

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL Visualisation Guide - Query Diagrams, Lineage & ERD

SQL parser for Databricks. Parsing SQL for table audit logging and much more

SQL parser for Databricks

Programmatically parsing the DataBricks query history with the FlowHigh SDK

Query history API

Unity Catalog and system tables

FlowForward.

Programmatically parsing SQL from query history API

Analysing the output of the FlowHigh SQL parser

FlowHigh User Interface for SQL parsing

Uli Bethke

SQL parser for Databricks. Parsing SQL for table audit logging and much more

SQL parser for Databricks

Programmatically parsing the DataBricks query history with the FlowHigh SDK

Query history API

Unity Catalog and system tables

FlowForward.

Programmatically parsing SQL from query history API

Analysing the output of the FlowHigh SQL parser

FlowHigh User Interface for SQL parsing

Related Articles

Cookies consent