Databricks Data Lineage – API, Tables, Unity Catalog

Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.


Published on January 6, 2025

In this blog post, I’m taking the Databricks data lineage feature for a test ride. Let’s see what it can do and where you might need to rely on a dedicated data lineage tool.

First, I’ll start with the basics. I’ll give you a high-level overview of what data lineage in Databricks actually is and why it’s important. Then, I’ll zoom out a bit and put it into context by looking at how it fits into the bigger picture – specifically, the Unity Catalog.

Once we’ve set the stage, I’ll walk you through the three ways you can access data lineage in Databricks. The first is via System Tables, where you can query lineage data directly. Next is the visual diagram, which gives you an interactive map of your data pipelines—think of it as a bird’s-eye view of your workflows. Finally, there’s the RESTful API, perfect for scenarios where you need automation and programmatic access.

In the last part of this post, I’ll get into the nitty-gritty details and limitations in Databricks data lineage. It’s not always enough for every use case. I’ll highlight scenarios where you will need a third-party data lineage solution to hit your goals.

So, let’s dive in and see what Databricks’ lineage feature is made of!

For those of you in a mad rush, I have compiled the key takeaways.

Databricks Data Lineage in a nutshell

Data Lineage in Databricks is part of the Databricks Unity Catalog. Unity Catalog is a metadata store used to manage metadata in Databricks.

Databricks gives you three options to access data lineage.

A REST API provides programmatic access to column and table level lineage.

Two System tables provide programmatic access to column and table data lineage through SQL.

Databricks also provides a visual diagram to browse data lineage based on the neighbourhood view paradigm.

As of December 2024, I have noticed several limitations in Databricks’ data lineage implementation.

  • The API is restricted to providing information about immediate upstream and downstream nodes. For example: If you query the lineage of a table, the API will only return the direct dependencies (parents and children) but not the complete lineage chain. This limitation makes it challenging to get a comprehensive view of the entire lineage graph programmatically. You would need to repeatedly query the API to build a full picture.
  • To overcome the API’s one-level limitation, you must manually query System Tables to retrieve deeper dependencies. However, Databricks lacks support for recursive SQL or Common Table Expressions (CTEs), forcing you to write complex, iterative scripts. This significantly increases manual effort and the risk of errors.
  • The current implementation does not capture lineage for UPDATE and DELETE operations, leaving gaps in tracking dependencies and limiting the ability to understand the full data pipeline.
  • The visual diagram provides an incomplete picture by omitting the actual logic or code behind data transformations, reducing its value for debugging or documentation.
  • To overcome these limitations you will need to turn to a dedicated data lineage tool.
  • Databricks lineage does not track the dependencies of individual SQL statements such as nested subqueries or database views. An SQL query visualiser can help you track intra query dependencies.

Let’s start with a quick intro to the concept of data lineage.

What is data lineage?

Hundreds of articles have been written about data lineage. Rather than adding more text to this I will paint a picture for you. Or even better a map. An Open Street Map (OSM). Just like OSM helps you find your way around when exploring a city or find the next petrol station on a roadtrip, data lineage helps you find your way around a data pipeline. It organises your data flows and data nodes into dependencies and shows you how everything hangs together.

Here’s a comparison table highlighting how a map, like OpenStreetMap, relates to data lineage.

Purpose

MapData Lineage

Helps individuals understand where things are and how to get from one location to another.

Helps organisations understand where data comes from, how it moves, and where it ends up.

Provides visibility into physical routes and pathways.

Provides visibility into data transformations, workflows, and dependencies.

Key Components

Map ComponentsData Lineage Components

Locations: POIs, Cities, landmarks, etc.

Data stores: tables, files, data sets

Paths/Routes: Roads, highways.

Data Flows: The movement of data, e.g. SQL code to move data

Waypoints: Stops along a route.

Transformations: Intermediate steps where data is cleaned, aggregated, or processed.

Legends/Keys: Explain symbols.

Metadata: Describes the meaning, changes, and relationships of the data.

Data lineage and the Unity Catalog

Data lineage in Databricks is part of the Unity Catalog.

Unity Catalog covers the following use cases

Enterprise Data Governance: Enforce data access policies and maintain compliance with regulations.

Data Discovery: Enable data teams to discover and explore datasets.

Data Sharing: Share datasets securely with third parties using Delta Sharing.

Auditability: Track data transformations and usage for audits and troubleshooting.

Unity Catalog is mainly used for managing data assets inside Databricks even though it is possible to integrate with foreign catalogs.

Microsoft Purview vs Unity Catalog

If you need to manage metadata across multiple data platforms, BI tools, and databases Microsoft Purview can complement the Unity Catalog as it provides much wider support for managing lineage and metadata across multiple tools and platforms

Open Source (OSS) Unity Catalog

It’s important to distinguish between the open-source Unity Catalog and the Databricks Unity Catalog, as they are significantly different.

The Databricks Unity Catalog focuses on data governance and includes features like data lineage. In contrast, the open-source Unity Catalog acts primarily as a metastore for managing various table formats, such as Iceberg and Delta Tables. Unlike the Databricks version, the open-source Unity Catalog does not offer data governance capabilities.

Data lineage in Databricks

Now that you know the most important bits about data lineage and the wider context of the Unity Catalog, let’s explore data lineage inside the Databricks platform. Data engineers, data analysts, and data scientists use lineage to track the dependencies of data sets (think tables) and data flows (think SQL) in a data pipeline.

Databricks announced its lineage feature in June 2022 and it became GA in December 2022.

Together with Snowflake, Databricks is one of the few data platforms with built in native support for data lineage and a data catalog.

Data lineage works for standard objects such as tables but can also be used to track the dependencies of ML models.

Databricks tracks the lineage inside the Query Editor, Notebooks, Workflows, Dashboards and Pipelines.

Development tools that trace data lineage in Databricks

  • Query Editor: An editor for writing and executing SQL queries to explore and analyse data in Databricks.
  • Notebooks: Interactive documents that combine code, visualisations, and markdown, supporting multiple programming languages like Python, Scala, SQL, and R.
  • Workflows: A feature for orchestrating and automating tasks, such as running jobs, pipelines, or notebooks in a scheduled or triggered manner.
  • Dashboards: Visual representations of data and insights created from notebook outputs or queries for sharing and monitoring.
  • Pipelines: Tools for building and managing ETL workflows using Delta Live Tables to ensure efficient, reliable data transformation.

How can I access data lineage in Databricks?

Data lineage in Databricks comes in three flavours.

API: Provides programmatic access to lineage information, allowing users to extract and integrate data lineage details into custom tools or workflows.

System Tables: Stores lineage metadata in Databricks system tables, enabling SQL-based queries for tracking data flow and transformations.

Visual Graph: An interactive, visual representation of data lineage that displays the relationships and flow of data across notebooks, jobs, and pipelines.

What type of DML and DDL is in scope for lineage?

DML and DDL coverage depends on how you access data lineage. The behaviour differs between the API, System Tables and the visual data lineage DAG.

DML

Let’s begin with the limitations and check the DML operations that are not supported by any of the three types of data lineage: the API, the System Tables, or the visual lineage graph.

  • UPDATE: Lineage information is not captured for UPDATE operations.
  • DELETE: Similar to updates, lineage is not available for DELETE operations.
  • INSERT VALUES: Lineage information is not captured for INSERT VALUES operations.

I think this is a significant limitation as these are common operations in a data pipeline. The impact may be a broken or at least incomplete lineage graph

What about MERGE and SELECT?

  • MERGE: Lineage information is not captured for MERGE operations by default. However, you can turn on lineage capture for MERGE operations by setting the Spark property spark.databricks.dataLineage.mergeIntoV2Enabled to true. Based on the documentation this will have a performance impact. I did not test the exact nature of the impact but it could be significant.
  • SELECT: The SELECT statement is only supported when you are using the System Tables to query data lineage. It does not show up for the lineage API or the lineage visualisation.

Databricks has full data lineage support for CREATE TABLE AS (CTAS) and INSERT INTO SELECT (IIS).

DDL

When it comes to DDL standard objects such as tables, UDFs, materialised views, delta live tables, and views are included in data lineage. I noticed one small issue worth mentioning. You can’t create a Materialised View inside a Notebook and no lineage is displayed as a result.

Visual data lineage graph

In Databricks, you can visualise the DAG (Directed Acyclic Graph) of your data pipelines directly in your browser.

When it comes to visualising a pipeline or DAG, you have two common choices.

  1. View the entire pipeline at once, or
  2. Focus on a specific part of the pipeline and explore step by step.

Databricks uses the second method, known as the neighbourhood view, to help you focus on what’s most relevant. You start by selecting a dataset, such as a table. This dataset becomes the anchor node, anchoring the visualisation. Databricks then shows you the immediate upstream and downstream nodes connected to this dataset.

From there, you can interact with the pipeline by expanding nodes step by step. You can trace data dependencies, navigate upstream to see where data comes from, or move downstream to understand where it flows next. This approach ensures you only see the parts of the pipeline you’re interested in, keeping the visualisation manageable.

Data lineage visualisation Track upstream anchor and downstream nodes with edges

Neighbourhood view for visual data lineage graph

Both Snowflake and Databricks use the neighbourhood view for visualising data lineage. Here is a list of the pros and cons of the neighborhood when viewing a data pipeline

PROS
  • Simplifies Complexity: Focuses on immediate connections. Makes it easier to interpret large graphs.
  • Scalable: Handles large datasets without overwhelming the user or system.
  • Efficient: Faster rendering and lower resource usage since only part of the graph is displayed.
  • Clear Dependencies: Highlights upstream and downstream relationships for quick debugging or analysis.
  • Interactive Navigation: Allows step-by-step exploration, reducing cognitive load.
CONS
  • Limited Context: Does not show the entire graph, making it hard to understand the overall structure.
  • Time-Consuming: Requires repeated expansions to explore multiple levels or nodes.
  • Anchor Node Dependence: Insights rely heavily on the chosen starting point.
  • Challenging Analysis: Global patterns and metrics like connectivity or bottlenecks are harder to identify.

Accessing the data lineage graph

I have compiled some step by step instructions for accessing the visual data lineage in Databricks.

Step 1: Click on Catalog

Databricks interface showing catalog workflows and SQL options menu

Step 2: Select the Schema where Pipeline is Built (Eg: Sales)

Databricks catalog interface showing Saurabh Guptas cluster organization files

Step 3: Select the Table for Lineage, e.g. Supermarket_Sales

Databricks interface showcasing supermarket sales data overview

Step 4: Click on the Lineage tab

data catalog overview supermarket sales table and its features

Step 5: Click on the See Lineage Graph button

data lineage interface for supermarket sales catalog exploration

Using the data lineage visualisation

Step 1: Open the Lineage Graph

database schema for sales and customer data analysis

Step 2: Click on the (+) Sign to extend the Lineage to see connections between tables

database tables and relationships for sales data analysis and reporting

I came across a particularly annoying limitation in the way the data lineage graph handles the expand (+) icon next to dataset nodes. The purpose of this icon is to allow users to expand the diagram and show upstream or downstream nodes for a given dataset. However, in many cases, clicking the icon led to………. Nothing. It simply didn’t do anything. This creates confusion and disrupts the user experience. The presence of the icon suggests that additional nodes exist. A more intuitive design would hide the icon entirely when there are no upstream or downstream nodes to display, preventing unnecessary clicks and improving usability.

From what I can tell, the issue lies in how Databricks calculates these dependencies. They’re only resolved at runtime when you click the icon, which also explains the noticeable delay when expanding a node.

Databricks displays a plus icon beside the sales_transactions table.

Node expansion view: Visualise table details with column types and expandable options

Nothing happens when clicking on the plus icon to expand the node downstream.

Table node view Displays table details with column names types and expandable options

One of the more frustrating things I’ve come across is the inability to collapse a node after expanding it. Once a node is expanded, it stays that way, leaving the graph cluttered and harder to navigate—especially in large, complex lineage diagrams. This, combined with the earlier issue of the unresponsive expand (+) icon, makes working with lineage graphs far more tedious than it should be. It feels like these limitations turn what should be a powerful visualisation tool into a source of frustration.

Expanding table sales downstream

Upstream node view Visualise table relationships column names and data types

No minus icon to collapse

Lineage view Trace data flow between interconnected tables with column details

Viewing column lineage

You can view column-level lineage in the data pipeline visualisation, which sounds promising at first. However, I ran into a major limitation: the lineage only extends one level up and one level down from the selected column.

In my opinion, this constraint significantly undermines one of the most important use cases for column-level lineage: impact and root cause analysis. To be truly effective, such an analysis needs to trace the lineage all the way back to the roots when going upstream or follow it down to the leaves when going downstream. Without this capability, the feature feels incomplete, more like a box-ticking exercise to claim column-level lineage support than a helpful feature for real-world problems.

Anyway, here is how you view column level lineage.

Step 1: Click on the Show more columns

data table displaying invoice ID and branch for supermarket sales

Step 2: Select the column for which Lineage is required, e.g. invoive_id. As you can see an arrow has been added for both upstream and downstream columns of column invoice_id

data lineage diagram showing sales database tables and relationships

Exploring node details

When you click on a node in the lineage graph you get some additional information about the table.

On the right hand side a new pane opens that shows us any columns that are a part of the table Columns Details:

database schema diagram for supermarket sales data management system

When you click on the Lineage tab, you uncover detailed information about where the table appears. Databricks lists all related objects, such as Notebooks, Workflows, and Query Editor windows. Arrows next to each object show whether the table acts as an upstream source or a downstream target.

Lineage details View table connections SQL pipelines and data flow for analysis

data lineage interface displaying tables and notebooks in analytics platform

You can also filter Lineage by

  • Upstream Connections
  • Downstream Connections
  • All Connections
Lineage view Explore upstream connections queries and data flow for sales transactions

Filtering by Upstream connections shows only the immediate upstream dependencies, limited to one hop in either direction.

Comprehensive data lineage Visualise upstream downstream nodes and table connections

Exploring edge details

In data lineage visualisations, the line connecting upstream and downstream nodes—such as tables or datasets—is called an edge. This term comes from graph theory, where it represents the connection or relationship between two entities, known as nodes. In the context of data pipelines, an edge shows how data flows or how one dataset depends on another.

When you hover over an edge, you can see details about the most recent activity between the connected nodes. Databricks highlights the last SQL operation executed on the dependent dataset and displays it for reference.

Data activity view Highlighting recent activity timestamps across connected tables

When you click on an edge connecting two nodes you see more information about the Lineage Connection such as the Source, Target and where the SQL was used.

In my example the source is table sales_data. The target is table sales and the SQL was captured from a Notebook.

Lineage connection details View source target last activity and captured lineage paths

What I found frustrating, though, is that the actual SQL code is missing from the lineage view. Sure, you can click on the link to the Notebook to find it, but this forces you into a context switch. A new window pops up with the Notebook, and then you have to sift through the code to locate the specific SQL you’re after. This process feels clunky and unnecessary. Imagine how much smoother it would be if the lineage view simply displayed the SQL code that loads the target from the source(s) right there. No extra steps, no new windows, just the information you need.

Data lineage in System tables

You can dive into Databricks data lineage programmatically through lineage System tables. Using these handy tables you can access both table-level and column-level lineage data

  • system.access.table_lineage
  • system.access.column_lineage

Important:

Before you can use the System lineage tables you need to enable them, otherwise they do not show up in the Catalog.

Enabling access to lineage tables

Use the following command to enable access to lineage System tables.

You need to provide the PAT Token, Workspace and Metastore ID together with the name of the schema (access) where the lineage tables are located.

Once you have enabled access, the lineage tables will appear in the Catalog.

Catalog view Navigate organisational systems databases and schemas for query access

Lineage tables and query_history

To unlock even more insights, you can join these lineage system tables with the query_history system table to pinpoint the exact SQL code that was executed.

Just connect the statement_id in query_history to the entity_run_id in table_lineage or column_lineage, and voilà… you’ve got the full picture of your data’s journey, down to the query that made it happen!

entity_run_id is an ID to describe the unique run of the entity, or NULL. This differs for each entity type:

  • Notebook: command_run_id
  • Job: job_run_id
  • Databricks SQL query: query_run_id
  • Dashboard: query_run_id
  • Legacy dashboard: query_run_id
  • Pipeline: pipeline_update_id
  • If entity_type is NULL, entity_run_id is NULL.

Not only for Writes but also for Reads

Unlike the visual lineage graph or the API, System tables also capture data lineage for Read only queries such as a SELECT statement.

Table lineage

System table table_lineage stores table level dependencies of SQL queries.

You can join this table to query_history to get the full text of your SQL statement

Tabular data view Analyse detailed columns and rows for database query operations

Column lineage

System table column_lineage stores the column level dependencies of SQL queries.

You can join column_lineage to System table query_history to get to the source code of the SQL statement.

Database table view Inspect query results with detailed columns and execution statuses

In this particular scenario we filter by statement_id on the following SELECT query:

As you can see Databricks looks up the asterisk (*) against metadata to get the full list of column names for table customers.

Query results table View raw data with columns statement types and source details

Some limitations about System lineage tables

Databricks makes it easy to access data lineage for both READ (DQL like SELECT) and WRITE (DML like INSERT, CTAS) operations, which is a big win for understanding how data flows. Even better, the lineage comes in a structured format, so you don’t have to deal with parsing JSON arrays to get what you need.

But let’s talk about the downsides. Navigating upstream or downstream through the dependency chain in your data pipeline requires far too much manual effort. You will need to recursively iterate over the system tables. It’s a 100% manual effort. A much better approach would be to have an API that gives you the lineage upstream to the root nodes or downstream all the way to the leaf nodes. If you’re trying to perform impact analysis or root cause investigations, this process quickly becomes a headache.

And it doesn’t stop there. Databricks doesn’t support recursive SQL, such as recursive CTEs, making it even harder to automate these lineage queries. It’s a lot like the limitations in the lineage API (more on that in a minute), where you can only see one level up or down for a table or column.

Data lineage REST API

Some regions don’t have support for System tables. For this scenario and where you prefer an API over System Tables you can use the Databricks REST API for lineage.

Just like the lineage System tables, the REST API also supports table and column level lineage.

You submit a table or column name to the API and it will give you the tables or columns one level up and one level down

For table lineage you use the following command:

For column lineage you use the following command

Explanation:

Databricks access token: This is a unique token used for authentication with the Databricks REST API.

Databricks host: This is the base URL of your Databricks workspace (e.g., https://<your-databricks-instance>.cloud.databricks.com).

<catalog>.<schema>.<table>: This specifies the full name of the table for which lineage is being tracked, including catalog, schema, and table names.

If you set Include_entity_lineage to true additional lineage information will be displayed.

ColumnName is the name of the anchor column to retrieve data lineage.

Limitations

You can’t control how many levels upstream or downstream to explore, which is a major limitation 😟. In practice, this forces you to create your own solution to recursively navigate the dependency graph 😭😱 via the provided API.

Examples for table lineage API

Lineage tracking for Table : Product_Sales

Output:

As you can see you get the output for both upstream and downstream tables. You only get upstream and downstream dependencies for direct neighbours.

In this case we have one upstream and one downstream table each.

For each downstream table you also get information about the Query and Notebook where the table occurred.

Examples for column lineage API

Lineage tracking for column Product_line in table products

Output:

The output shows that we have 1 column upstream and three columns downstream.

Looking at the visual lineage diagram confirms the correctness.

data lineage diagram showcasing sales data product sales and customers

The output does not include information about notebooks or queries. To retrieve this, you need to make a separate API call for the table containing the column product_line.

Data lineage and foreign catalogs

Foreign catalogs in Databricks allow you to access external metadata and data resources that reside outside of the Unity Catalog. They act as a bridge to external data catalogs or systems. Here are some examples of support for foreign catalogs

  • MySQL
  • PostgreSQL
  • Amazon Redshift
  • Snowflake
  • Microsoft SQL Server
  • Azure Synapse

Foreign catalogs can be used for query federation, which means that you can run queries against one or more foreign catalogs and then stitch the results together inside Databricks.

How is this relevant for data lineage?

While the documentation mentions that you can retrieve data lineage from foreign catalogs it does not give any detailed steps on how to achieve this.

In theory you can bring your own data lineage from supported foreign catalogs to Databricks. However, it is unclear how this works in practice.

Data lineage for ML pipelines

Data lineage is not only available for standard data pipelines but also for ML pipelines and ML models you build.

To enable lineage Tracking for models you need to log your models with the method FeatureEngineeringClient.log_model. Once done, Databricks automatically captures and displays the lineage of feature tables and Python UDFs used in model computations.

Lineage Visualisation: You can view the lineage of feature tables, models, or functions by navigating to the page in Catalog Explorer and selecting the Lineage tab. This gives you a view of the data flow and the dependencies.

Databricks catalog interface displaying personal organization structure

ML lineage is also tracked for upstream tables that feed into the model.

wine database model table with version and column details displayed

When do you need a third-party data lineage tool for Databricks?

I will conclude this blog post with the limitations you will encounter in Databricks data lineage.

Databricks meets some basic requirements around data lineage. However, the features on offer are rather limited. Most companies will have more complex requirements than what Databricks has to offer.

As an example, a significant limitation is that DELETE and UPDATE operations are not handled. It means that you will be miss

Here are some more reasons you might need a third party data lineage tool that goes beyond the basics on Databricks.

1. Full Pipeline Overview

  • You need a unique list of all data pipelines in your environment, not just the neighbourhood view.
  • Neighbourhood views only show dependencies for a selected table, leaving you in the dark about the big picture.

2. Comprehensive SQL Visibility

  • It’s not enough to see dependencies—you need to view the SQL that loads each node, such as a table.
  • A full pipeline visualisation is essential, going beyond isolated anchor nodes or objects.

3. Automated Change Management

  • True impact analysis involves more than just identifying dependencies.
  • Example: Adding a column to a table may not cause immediate failures, but downstream processes like SELECT * queries could inadvertently pull unexpected data, leading to inaccuracies.
  • Proper analysis traces the ripple effects, ensuring ETL jobs, views, and reports are updated to handle changes.

4. Multi-Hop Dependency Analysis

  • Databricks stops at one hop upstream or downstream. You need a tool that goes deeper, automating recursion through the entire pipeline hierarchy.

5. Indirect Lineage Tracking

  • Indirect lineage captures how changes, like modifying a WHERE clause or GROUP BY, affect downstream processes indirectly:
    • Filtering rows in a WHERE clause could skew reports relying on this data.
    • Adjusting a GROUP BY could alter result structures, causing failures or unexpected results downstream.

6. Cross-Platform Lineage

  • Your organisation uses multiple databases, tools, and technologies, and you need to trace lineage across all of them—not just Databricks.

7. Non-Database Technologies

  • You need to track lineage in ETL tools, BI dashboards, Excel files, and even custom Java code.

8. SQL Query Depth

  • Inline views, subqueries, and complex SQL logic are critical components. End-to-end lineage must include these to give a complete picture of data flow. Our own tool FlowHigh SQL visualiser can visualise your SQL query code.
SQL code snippet with complex joins and data visualization layout

Uli Bethke

About the author:

Uli Bethke

Co-founder of Sonra

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.