Benefits for Data Governance and Data Engineering
You are probably surprised to learn that SQL parsing underpins many data governance and data engineering use cases ranging from data lineage to performance tuning, data protection, productivity and many more.
An SQL Parser deconstructs an SQL query into its individual components. Let me walk you through a simple example.
SUM(ord.order_amount) as order_amt
JOIN customers cust ON (ord.customer_id = cust.customer_id)
Running this query through our SQL Parser FlowHigh produces the following visualisation
Looking at the parsed query tree that FlowHigh generated we can see the various elements of the SQL query
- The input columns: customer_id, order_amount
- The output columns: customer_id, order_amt
- The input tables: orders, customers
- The join columns: customer_id (customers) and customer_id (orders)
- The aggregation column: customer_id
- The sort column: order_amount
Having this information available we can meet a variety of different business requirements and use cases.
We will soon release FlowHigh, a commercial online SQL Parser. The first dialect we support will be Snowflake but we will add all of the important SQL dialects over time.
Register your interest for FlowHigh the SQL Parser from Sonra and we will let you know once it becomes available.
Let’s get a bit more technical and look at how an SQL Parser works under the hood. You can skip this section if your main interest is in the use cases and business benefits of using an SQL Parser.
SQL parsing is similar to parsing any other programming language. The parser deconstructs the SQL query based on the SQL syntax, identifies each of the query components, and then creates an Abstract Syntax Tree (AST) in a hierarchical format.
A simple example of an Abstract Syntax Tree for an SQL query
Getting from SQL query to AST is a three step process
The first step is lexical analysis (also known as tokenization). The lexer attaches meaning to a set of characters and strings in our SQL query. These strings with identified meaning are referred to as tokens.
An example of SQL tokens
|Token name||Sample token values|
|separator||( , ) ;|
|operator||< = != <>|
|comment||/* some comment about code */|
Lexical analysis is relatively simple. It is typically followed by a parsing stage. Parsing is also known as syntactic analysis.
The parsing step takes the tokens as input and builds an Abstract Syntax Tree. It checks that the tokens form an allowable expression by recursively iterating and looping over them.
You can categorise use cases for SQL parsing into data governance and data engineering use cases. Let’s start with some SQL Parser use cases for data governance
A SQL Parser can help you populate a data catalog. Companies use data catalogs to make it easier for data analysts and data scientists to find and discover useful corporate data sets. A data catalog typically contains a data dictionary with the descriptions, keywords and tags for tables and columns in a data warehouse, data management system, or database.
However, the data dictionary contains only very static information. Organisations can go a step further and enrich the data catalog with real usage stats of how users actually consume corporate data. By using an SQL Parser we can deconstruct the queries that users issue against the database and identify the columns and tables that were queried, how tables are joined and filtered and much more.
Additional information such as the name of the user who submitted the query, the time when the query was run, what tool submitted the query etc can be extracted from the SQL queries that were executed. The history of SQL queries is typically stored in the query history in the information schema of a database. The SQL Parser can extract all of the useful information from those queries over time.
Here is a list of some common usage metrics that you can capture in a data catalog for data set discoverability:
- Access counts for columns and tables
- Distinct access counts for columns and tables
- The most common queries that are run against a particular table. How is a table used in practice?
- Commonly joined tables
- Table usage by users or groups of users and teams
- Other query patterns
Having those metrics available you can answer questions such as:
- find popular datasets used widely across the company, e.g. what are the TOP 10 tables used at the company
- find datasets that are relevant to the work my team is doing
- find datasets that I might not be using, but I should know about
- Find related datasets
- Identify tables and columns that are rarely or never queried. They may be dropped.
You can read about the data catalog at Spotify to find out more about how they are using metadata from SQL queries to enrich their data catalog.
With an SQL Parser you can extract business rules and business transformation logic.
Here is a simple example:
SELECT ADD_MONTHS(date_column, 2), date_column AS RESULT FROM TABLEX;
We apply business transformation logic to the date_column by using the function ADD_MONTHS to add two months to the column date_column.
Using our SQL Parser, FlowHigh we can identify any columns where a function such as ADD_MONTHS has been applied. Based on this we can auto generate documentation for business transformation logic.
A very common application for an SQL Parser is to detect the flow of data across and inside systems and databases of an enterprise.
Here is a very simple example of how data from the ORDERS table flows to the ORDER_AGG table.
Lineage can be a visual representation of data origins (creation, transformation, import) and will help answer questions such as Where did the data come from?
There are two types of data lineage.
Table level lineage
Table level lineage tells us about dependencies at table level.
Table ORDER_AGG is a product of Table CUSTOMER and ORDER
Column level lineage
Having data lineage available at column level is a lot more useful as it allows you to trace the data at a more granular level, e.g. if you know that column CUSTOMER_NAME in table CUSTOMER contains PII and this column is used to populate column CUSTOMER_NAME in table ORDER_AGG then you know automatically that column CUSTOMER_NAME also contains PII. As a result you can use this information to identify any tables that contain PII data or even use it to automatically mask PII data across all tables.
The importance of data lineage
Having the ability to trace data flows through systems is very useful for multiple use cases
- With data lineage and metadata you can automate the creation of column level access controls.
- Data lineage allows you to identify all of the sensitive data within your data warehouse and operational systems.
- You can perform impact analysis, e.g. which tables are affected downstream if the name of a table or column is changed.
- Regulations such as BCBS 239 require the finance industry to trace back data to the authoritative source.
- With data lineage you can identify records that need to be deleted throughout your systems e.g. as part of a GDPR requirement.
- Data lineage can be used to troubleshoot the root cause of issues, e.g. the failure of an ETL process (stepwise debugging).
- Data lineage provides a complete audit trail for a data point of interest from the source to its final destinations.
A SQL Parser can be used to address auditing requirements for compliance. Regulations such as Sarbanes Oxley or HIPAA have common auditing-related requirements.
With audit logging you can monitor and record selected user database actions such as access to tables and columns by extracting the table and column names from SQL queries that users run.
- Audit logging will help with detecting suspicious activity on a database.
- Enable accountability for actions. These include actions taken in a particular schema, table, or row, or affecting specific content.
- Notify an auditor of actions by an unauthorized user. For example, an unauthorized user could change or delete data,
- Detects problems with an authorization or access control implementation.
Using an SQL Parser you can deconstruct queries submitted through websites to detect SQL injection attacks and prevent them. This research paper discusses the details of using an SQL Parser to detect SQL injection.
Apart from data governance apps SQL parsing is also useful for some general data engineering use cases.
You can use an SQL Parser to increase your productivity when working with SQL.
SQL is a very powerful language. It can be used to solve almost any type of data related problem or question. However, SQL can get quite verbose. Many beginners and intermediate users of the language tend to write unnecessarily complex SQL. An avalanche of subqueries, nesting and spaghetti code is the result.
SQL generators such as BI tools tend to generate poor SQL. This happens if the underlying model or semantic layer has been poorly designed. Some BI tools are just plain bad when it comes to generating good SQL.
Here are some visualisations from our SQL visualisation tool FlowQL
Debugging SQL becomes a breeze with an SQL visualisation tool.
500 lines of code at a glance in a diagram. A picture says more than a thousand words has never been more fitting.
Here is a summary of the use cases for SQL query visualisation.
- Quickly make sense of the query tree and query hierarchy
- Identify issues and query anti-patterns such as self-joins
- Get a list of all the physical tables and columns involved in a query
- Identify sub-queries, virtual tables, cross joins, and recursive queries at a quick glance
- Document important queries, e.g. document your ELT / ETL pipeline
SQL flow visualisation can also be done in reverse, i.e. creating a query through a visual tool and then generating the SQL code from it.
A well known productivity use case for an SQL Parser is SQL formatting. Using an SQL Formatter you can beautify SQL code, e.g. apply proper indentation, apply styles, and format the SQL code according to good practices.
FROM monthly_sales PIVOT(SUM(amount) FOR MONTH IN ('JAN', 'FEB', 'MAR', 'APR')) AS p (EMP_ID_renamed, JAN, FEB, MAR, APR) ORDER BY EMP_ID_renamed;
FROM monthly_sales PIVOT(SUM(amount)
FOR MONTH IN ('JAN', 'FEB', 'MAR', 'APR')) AS p (EMP_ID_renamed, JAN, FEB, MAR, APR)
ORDER BY EMP_ID_renamed;
Sometimes you need to make a change to a table. Using an SQL Parser you can determine the impact to upstream and downstream processes.
By analysing the SQL that is run against the changed table you will be able to determine what code needs to be refactored. You can make sure that nothing breaks if you go live with the change to your table.
Another useful application for an SQL Parser is to detect performance issues or recommend performance improvements.
Using an SQL Parser you can detect SQL anti-patterns. Anti-patterns are SQL queries that do not follow best practices.
I have listed some common anti-patterns below. Sometimes the anti-pattern is legitimate under certain circumstances. The list should be treated as a rule of thumb only.
- Using a self join instead of window functions
- Using the same table multiple times in a complex query may indicate that the query is not well written
- Using DISTINCT is often an indicator that something is wrong with the query or the underlying data quality
- Using UNION instead of UNION ALL
- Selecting all of the columns from a table SELECT * FROM
- Detecting unnecessary joins in a query, e.g. where the table is used in a join but then not used as part of the query.
While you can detect certain anti-patterns by just looking at a piece of SQL code you frequently require additional information, e.g. the way the data has been modeled physically, data distribution stats, information on data volumes, data access patterns etc.
Just looking at a query in isolation is not enough to propose performance tweaks. Frequently, you need to look at query patterns over time and the query history to make performance recommendations.
I have listed some examples where looking at the query history can be used to make recommendations to adapt the physical data model design.
- Looking at the table join columns you can determine an appropriate set of distribution keys
- Looking at frequently used column filters in the WHERE, HAVING, or QUALIFY clause can determine recommendations for indexes or sort/cluster keys.
- Queries that frequently aggregate on columns make good candidates for Materialised Views
Using an SQL Parser you can translate the SQL syntax from one SQL dialect to another SQL dialect by mapping the grammar between these.
This is useful for migration projects from one database vendor to another, e.g. you might want to migrate your legacy data warehouse to a cloud data warehouse such as Snowflake.
You can take this one step further and migrate SQL code to an ETL tool. Many ETL tools have an SDK that can be used to programmatically generate transformation operators. Using an SQL Parser you could migrate SQL code to a visual representation in an ETL tool such as Oracle Data Integrator using the SDK of the tool.
The other direction is possible as well. Using the SDK you could generate SQL code from visual ETL transformations and operators.
Various open source SQL Parsers exist, e.g. SQL Parse is an implementation of an SQL parser in Python.
The Uber engineers also built a lookup phase into the parser that looks up information in the information schema such as the columns for SELECT b.* FROM tableb as b. This component resolves column names that are obscured in the SQL itself, e.g. looking at SELECT * in isolation will not tell you the column names that are part of the query. You can augment the parse tree by looking up this information from the database’s information schema.
The diagram from the website outlines the three step process they implemented.
The open source SQL Parsers work for some simple use cases. You can even build your own SQL Parser from scratch.
However, if you need something more robust you should look at a commercial SQL Parser. We will soon release FlowHigh, a commercial online SQL Parser. The first dialect we support will be Snowflake but we will add all of the important SQL dialects over time.
- You need broad support for SQL dialects
- You need support for the long tail of SQL features. These are features that are very specific to a particular dialect, e.g. CONNECT BY on Oracle
- You don’t need to host or administer the SQL Parser. Just access it online via an SDK or an API
- You need product support
- You need custom features or enhancements
Register your interest for FlowHigh the SQL Parser from Sonra and we will let you know once it becomes available.