BigQuery Data Lineage Guide – Features, API, & Limitations

Uli Bethke

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.


Published on February 2, 2025

In this post I dive deep into the BigQuery data lineage feature. For those of you who are not familiar with data lineage here is a quick primer.

 Data lineage is the process of tracking and visualising the journey of data as it moves through different systems, transformations, and workflows. It shows where data comes from (source), how it changes along the way (transformations), and where it ends up (destination).

It helps you answering questions like:

  • Where did this data originate?
  • What transformations or calculations were applied to it?
  • Which reports, dashboards, or systems are using this data?
  • What happens if I make a change to my data pipeline and who will be impacted by this change.

In this blog post I will first give you an intro to BigQuery Lineage and how it relates to the GCP Dataplex catalog. I will then get hands on and show you how to set up and enable lineage in BigQuery. Next I will create a data pipeline and I explore the various features of BigQuery lineage using this pipeline as an example. I will then show you how you can access the BigQuery API programmatically using the REST API. I have prepared various code examples in python that will show you how to work with the most common endpoints and the dependencies between them. I finish out the post with some thoughts on the BigQuery lineage limitations and explore scenarios when a third party data lineage tool is a better fit.

If you are in a rush I have compiled the most important points here:

BigQuery lineage is a part of the Dataplex data catalog. Apart from BigQuery, Dataplex also supports lineage for other GCP tools such as CloudComposer.

You can access lineage through a visual diagram or programmatically through an API.

BigQuery doesn’t support column-level lineage yet, which is a big downside compared to Snowflake and Databricks. That said, column lineage is in private preview as of January 2025.

BigQuery lineage does not support impact or root cause analysis.

One thing that really annoys users is how lineage doesn’t show up in real time and can take up to 24 hours to appear. This causes a lot of confusion, and you’ll see plenty of questions about it on forums like Stack Overflow and Reddit.

The visual lineage graph is super clean and easy to use. SQL processes are shown as their own nodes, which makes more sense compared to Databricks, where they’re hidden as edges.

The REST API is pretty straightforward to use, but unlike Databricks or Snowflake, there aren’t any system tables or views to access lineage through SQL.BigQuery does not visualise individual SQL queries. You can use FlowHigh SQL visualiser to visualise SQL code.

Ok. Let’s dive into the world of BigQuery data lineage

BiqQuery Lineage and Dataplex

BigQuery data lineage is a feature of the Google Cloud’s Dataplex data catalog.

Google Cloud’s Dataplex offers a centralized platform to manage, monitor, and govern data across various data architectures including data lakes, data warehouses, data mesh, and data marts.

Some key features of Dataplex:

  • Data Organization and Lifecycle Management: Dataplex introduces constructs such as lakes, zones, and assets to logically organize data in alignment with business domains or units. This abstraction enables policy setting for data access, security, and lifecycle management without necessitating data movement or duplication.
Unified data management framework with data warehousing and governance tools

  • Data discovery and Catalog: Dataplex automates data discovery, classification, and metadata enrichment of structured, semi-structured, and unstructured data, stored in Google Cloud and beyond, with built-in data intelligence.
  • Centralized security and governance: Dataplex enables central policy management, monitoring, and auditing for data authorization and classification, across data silos.
  • Built-in Profiling, Data quality and Lineage: Dataplex automates data quality across distributed data and enables access to data and automatically collects data lineage for BigQuery and other GCP products and services. This is the area we will cover in this blog post focusing on BigQuery.

Besides BigQuery, the Dataplex catalog also offers lineage for these other systems and services on Google Cloud:

  • Cloud Data Fusion: A fully managed, cloud-native data integration service that allows users to build, manage, and monitor data pipelines with a visual, drag-and-drop interface.
  • Cloud Composer: A managed workflow orchestration service based on Apache Airflow, enabling users to schedule and manage data pipelines.
  • Dataproc: A fully managed service for running Apache Spark, Apache Hadoop, and other open-source data processing frameworks.
  • Vertex AI: A machine learning platform on Google Cloud that supports building, deploying, and scaling AI models with tools for data preparation, model training, and MLOps.

If you want to find out more check this series of articles about Dataplex on Medium by a Google employee.

What is BiqQuery Lineage?

Now we know that BigQuery lineage is a feature of Dataplex. But what exactly is it and how is it useful?

BigQuery lineage refers to the ability to track the flow of data through various stages of processing, transformation, and usage. It gives you a detailed map of where data comes from, how it is transformed, and where users consume it. This is important for understanding dependencies in data pipelines.

BigQuery interface showing vehicle sales view and data lineage connections

Let me give you an analogy. Think of BigQuery lineage as tracking the process of baking bread in a bakery, where raw ingredients (data sources) go through various steps (data transformations) before becoming a finished product (reports or analytics).

  • Raw Ingredients (Data Sources): Flour, water, yeast, and salt represent raw data. These are sourced from different suppliers (external systems, CSV files, or APIs).
  • Mixing and Kneading (Data Transformation): The ingredients are combined and processed, similar to how raw data is cleaned, enriched, or joined with other datasets through SQL queries or ETL tools. Each step of transformation builds on the previous one—mixing ingredients transforms them into dough.
  • Distribution (Consumption): The packaged bread is sent to stores or directly to customers, just as processed data is sent to reporting tools, dashboards, or machine learning models and accessed by end users.

Why use BigQuery Lineage

There are a couple of scenarios when data lineage in BigQuery is useful. I have listed a few of them.

  • Debugging and Troubleshooting: When a dashboard shows incorrect results, lineage helps pinpoint the source of the issue. For example, you can trace back to see if the problem lies in a specific dataset or the code for the transformation logic.
  • Impact Analysis: Before modifying a dataset, table, or SQL statement, lineage reveals all downstream dependencies. This prevents unintended disruptions to dashboards, reports, or machine learning workflows that rely on the data.
  • Data Governance and Compliance: In industries with strict compliance requirements (e.g., GDPR or HIPAA), lineage helps demonstrate how PII data flows through the system, who has access to it, and how it is transformed. Data lineage in this scenario brings light to the black box.
  • Optimising Data Pipelines: By visualising the entire data flow, you can identify issues or anomalies in your pipelines.

Accessing data lineage in BigQuery

You can access data lineage using:

BigQuery lineage versus Snowflake and Databricks lineage

Unlike Snowflake or Databricks, which provide lineage insights through system tables that explicitly track dependencies between tables, BigQuery.

BigQuery also does not provide data lineage at the column level.

Bring your own data lineage to BigQuery

The Data Lineage API in Dataplex allows you to manually record lineage information for data sources that are not natively supported by Dataplex.

Bring your own data lineage” means manually integrating and importing lineage metadata that you have independently captured or generated, rather than relying solely on the Dataplex’s native lineage tracking capabilities. This approach allows you to combine lineage information from diverse sources, such as custom scripts, other databases, third-party tools, manual documentation, or logs, into a unified platform.

You can use the DataLineage API to record externally curated data lineage.

OpenLineage

OpenLineage is an open-source framework designed to collect, track, and manage data lineage across various data tools and pipelines. It provides a unified framework for capturing metadata about data flows and transformations,

You can directly import lineage recorded in OpenLineage to Dataplex. This is a less manual effort than using the data lineage API.

Key Lineage Components: Processes, Runs, and Events

When working with data lineage in BigQuery you need to understand some key concepts

Process

This concept is easy to understand. It is a data transformation process such as an SQL INSERT statement. For a full list of supported Processes check the documentation/

Here is an example of a Process

Run

This concept is also straightforward. It is the execution of a Process. In other words it is an instance of a Process, Data lineage is only created when a Process is executed and moves data from sources to targets. Unless a Process is run it does not create data lineage Events.

A Run does not have a visual representation on the lineage diagram but you can access the Run details through the UI.

Event

A data lineage Event is created when a Process is executed in a Run. Events are represented as edges on the lineage diagram that define the dependencies between the source and target tables.

DML code process for vehicle sales analysis with data dependencies

Sources and targets

These are the data sets such as tables that are recorded as Events when an SQL Process such as an INSERT statement is executed as a Run.

Setting up BigQuery lineage

Setup and lineage permissions

Before you can use BigQuery lineage you need to go through the following steps:

  • You need to make sure that you billing is enabled on your GCP project
  • You need to enable the Data Catalog, BigQuery, and data lineage APIs.
  • You need to enable the following roles

Now we are ready. Let’s build a data pipeline and then visualise it in BigQuery,

Creating a sample data pipeline

In this section I build a data pipeline from scratch. If you are not interested in the technical details of building a data pipeline you can skip to the next section where I browse the visual diagram and the lineage of the data pipeline.

I used the following data set as the source for our pipeline car_prices.csv

Set up project and data set

Step 1: Create a new Project

  • In the Google Cloud Console, click on the project drop-down menu at the top of the screen and Click on “New Project.”
Google Cloud project selection interface for DataLineageSQL access

Step 2: Set Up Your Project

  • Enter a project name and Click “Create” to set up your new project.
New project setup interface for efficient project management and organization

Step 3: Navigate to BigQuery

  • Click on your project name in the Cloud Console, navigate to “BigQuery,” and then click on “Create Dataset.
BigQuery Studio interface for data analysis and management tools

Step 4: Fill in Dataset Details

  • Fill in your dataset name (Data set ID), choose a location for your data, and click “Create dataset.”
Create new dataset in Google Cloud BigQuery with multi-regional options

Step 5: Create a Table

  • To load data into your dataset, click on your dataset name and then click on “Create Table.”
Data management interface with queries workflows and external connections options

Step 6: Choose Data Source

  • Choose the source from which you’d like to load data (e.g., upload from local files or connect to other data sources).
Upload and configure file settings for data table creation in the app

SQL Pipeline

Next I run the code to create the data sets and flows in our pipeline.

Create Vehicle Info Table

Create Sales Info Table

Create Vehicle Condition Table

Create premium_vehiclesTable

Create vehicles_below_market View

Create Price Analysis Table

Create Market Analysis View

Create premium_analytics Table

Update Vehicle Info Table for Premium Vehicles

Insert into Premium Vehicles Table

Merge Data into Vehicle Info Table

Delete Records from Premium Vehicles

Update Premium Vehicles

Insert Values Into sales info

Create a Function to Calculate Market Position

Insert Data into Price Analysis Using the Function

Create a Materialized View for Premium Car Sales Summary

That’s it.

In the next section I will browse the lineage graph of the pipeline.

Accessing BigQuery data lineage

Accessing BigQuery Lineage From Dataplex

You can access BigQuery lineage from inside Dataplex.

Google Cloud Dataplex interface for managing data lakes efficiently

Use the search box at the top to search for a table in BigQuery to get lineage and dependencies

Google Cloud Dataplex management interface for vehicle data analysis

This will bring you to the home page of table price_analysis and you can then access the lineage information for this table

Data analysis interface showcasing price analysis schema fields and filters

Here is the lineage graph

Data lineage visualization for price analysis of vehicle datasets

Accessing BigQuery Lineage directly

Alternatively you can browse to the same table’s page in BigQuery explorer and access the lineage graph from there

BigQuery interface showcasing price analysis schema for vehicle data insights

Visualising BigQuery Data Lineage

Graph mode

Let’s have a closer look at what features the visual lineage graph gives us.

I start with the basics first.

When you select a table to view its lineage, the diagram displays its direct neighbours. The selected table, referred to as the anchor node, is shown along with its immediate parent and child nodes. These immediate dependencies are the nodes that are just one step away from the anchor node.

Vehicle data flow diagram highlighting premium car sales analytics integration

In this figure we have a table premium_vehicles.

premium vehicles for luxury driving experience and quality performance

This is the anchor node. The lineage graph shows the upstream and downstream tables that depend on the anchor node table.

If you click on a table node the list of columns for the table is displayed.

Data structure visualization for premium vehicle analytics management

The visualization also contains the SQL statements as separate nodes. In BigQuery speak these are called Processes

search icon with arrows for easy navigation and accessibility

This is nice and a much better implementation than on Databricks where the SQL is represented and obfuscated as edges, which is not an accurate way of representing these objects.

By clicking on the query node BigQuery will display the actual SQL code

SQL process flow diagram for premium vehicle data analysis and management

The code is nicely formatted and very easy to read.

For each query you can also display executions aka Runs

Premium vehicles data lineage overview showcasing analytics processes

This information is helpful for understanding when the query was last run.

Data lineage not showing for your tables and SQL?

One common source of frustration for users of BigQuery lineage is that lineage does not show up.

The root cause for this behaviour is that it takes between 30 minutes to 24 hours for the lineage information to register inside the Dataplex data catalog.

There are other reasons why lineage may not show:

  • The Data Lineage API is not enabled in either the active project or the compute project.
  • Additionally, you may lack the Data Lineage Viewer role (roles/datalineage.viewer) in one or both of these projects.

List mode

In list mode you get the downstream lineage with parents (source) and children (target) and the depth.

Data lineage mapping for premium vehicles analytics visualization tool

In this example, the anchor table premium_vehicles has three downstream tables, as shown at Depth Level 1. One of these tables, premium_analytics, acts as a parent table for an additional table further downstream at Depth level 2.

Data Lineage API in BigQuery

Overview of the Data Lineage API

GCP provides a REST API for data lineage. You can use the API to programmatically create lineage or to retrieve information about lineage.

The most important REST resources are the endpoints for Processes, Runs, and Lineage Events.

Here is a list of the common methods for each endpoint.

  • create: Initiates a new process.
  • delete: Removes a process by its specified name.
  • get: Retrieves details of a specific process.
  • list: Enumerates processes within a given project and location.
  • patch: Modifies an existing process.

Using the BigQuery lineage API

I have put together some Python code that queries the lineage API.

Any references to object IDs such as the project ID I used have been obfuscated.

Let’s start with the pre-requisites.

Pre-requisites

List processes

The method list_processes_simple retrieves and displays detailed information about lineage processes within a specified Google Cloud project (9993XXXXX963) and region (eu). The output includes Full names, display names,

This gave me 8 processes of type Query.

Available processes list for project ID 99930 in EU region

You can also use the API Explorer’s user interface to query the REST API and list Processes. This approach is particularly useful for testing purposes.

API documentation page showing data retrieval methods and parameters

Retrieve Specific Lineage Process Details

The method get_process retrieves and displays detailed information about lineage processes within a specified Google Cloud project (999XXXXXX963), region (eu) and process_id = “sh-5bb260cfXXXXXXX126fd32ae1d16479”. I got the process_id in the previous step where I listed all of our processes.

The output includes the full name, display name, and the bigquery_job_id.

BigQuery process details showcasing query attributes and job ID

List Runs for a Process

Next I list the Runs for the Process. The method list_runs_for_process lists all runs associated with a specific lineage process. It retrieves the runs based on the provided process name.

We see that only 1 Run is associated with this Process.

Retrieve Specific Run Details

This method get_run retrieves detailed information about a specific run associated with a lineage process. It outputs the run’s name, display name, state, start time, and end time.

Run details showcasing project execution timestamps and status updates

List Event Lineage

The method list_event_lineage function lists all lineage events associated with a specific lineage process. It provides details about each event’s source and target.

Data lineage events showcasing source and target information for projects

BigQuery Lineage Limitations: 3rd party lineage tools

BigQuery has three significant limitations

  • No Column Lineage: BigQuery tracks table-level dependencies only, making it unsuitable for detailed transformation or column-level analysis. Without column lineage you will not be able to perform impact or root cause analysis. Column lineage is currently in private preview.
User comment about Column Level Lineage feature release for private preview

  • Delayed Processing (up to 24 hours): Lineage updates are delayed by up to 24 hours, limiting real-time visibility into data flows and transformations.
  • Limited Retention (30 days): Lineage metadata is stored for only 30 days, restricting historical tracking and long-term auditing.

Also consider the following limitations

  • BigQuery does not automatically track lineage across third party tools such as ETL or BI tools.
  • While you can import lineage from other systems this is a manual process in the sense that you need to write code to achieve it.
  • BigQuery shows the SQL Processes that move data as separate nodes and not as edges. This is great but it does not let you drill into the SQL code and visualise that further. Inline views, subqueries, and complex SQL logic are critical components. End-to-end lineage must include these to give a complete picture of data flow. Our own tool FlowHigh SQL visualiser can visualise your SQL query code for debugging purposes.

Frequently Asked Questions (FAQ)

Can I use the BigQuery lineage feature with dbt and other ETL tools?

Yes. As dbt generates SQL that is pushed down into BigQuery it will be displayed in the lineage graph.

How long does it take for data lineage to show up in the lineage graph?

It can take between 30 minutes and 24 hours for lineage to show up. This is one of the various limitations of BigQuery lineage.

Why don’t my tables or SQL statements appear in the visual lineage graph?

There are a few possible reasons for this, but the most common one is timing. BigQuery may take some time to record and process lineage information, so it can take up to 24 hours for the data to appear in the visual graph.

For a detailed list of potential issues and troubleshooting steps, Google provides a comprehensive guide on common reasons why lineage might not display in BigQuery.

Can lineage track external systems feeding into BigQuery?

Yes. You can use the BigQuery API to programmatically create and import data lineage from external sources.

Does BigQuery support both table and column lineage?

As of January 2025, BigQuery does not support data lineage at the column level.

You can check how BigQuery stacks up against lineage on Snowflake and lineage on Databricks.

Uli Bethke

About the author:

Uli Bethke

Co-founder of Sonra

Uli has been rocking the data world since 2001. As the Co-founder of Sonra, the data liberation company, he’s on a mission to set data free. Uli doesn’t just talk the talk—he writes the books, leads the communities, and takes the stage as a conference speaker.

Any questions or comments for Uli? Connect with him on LinkedIn.