Skip to content

Tutorial: Exploring Data Lineage

Learn how to trace data dependencies and understand data flow through your organization.

Prerequisites:

What You Will Learn

  • How to trace upstream dependencies (where data comes from)
  • How to discover downstream consumers (who uses this data)
  • How to control lineage depth
  • How to interpret lineage results

Understanding Lineage

Data lineage shows the flow of data through your systems:

flowchart LR
    subgraph Upstream
        A[Raw Events] --> B[Staging Table]
    end
    B --> C[Your Dataset]
    subgraph Downstream
        C --> D[Dashboard]
        C --> E[ML Model]
    end
  • Upstream: Data sources that feed into your dataset
  • Downstream: Systems and reports that consume your dataset

Step 1: Find a Dataset

First, find a dataset to explore. Ask:

"Search for customer datasets in DataHub"

Pick a dataset from the results. For this tutorial, we will use: urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.analytics.customer_metrics,PROD)

Step 2: Get Upstream Lineage

Discover where the data comes from. Ask:

"What are the upstream dependencies for the customer_metrics dataset?"

The AI uses datahub_get_lineage with direction "UPSTREAM":

{
  "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.analytics.customer_metrics,PROD)",
  "upstream": [
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.sales.customers,PROD)",
      "name": "customers",
      "type": "DATASET",
      "platform": "snowflake"
    },
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.sales.orders,PROD)",
      "name": "orders",
      "type": "DATASET",
      "platform": "snowflake"
    }
  ],
  "downstream": []
}

This tells you that customer_metrics is built from the customers and orders tables.

Step 3: Get Downstream Lineage

Now discover what depends on this dataset. Ask:

"What downstream systems use the customer_metrics dataset?"

{
  "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.analytics.customer_metrics,PROD)",
  "upstream": [],
  "downstream": [
    {
      "urn": "urn:li:dashboard:(looker,customer_360)",
      "name": "Customer 360 Dashboard",
      "type": "DASHBOARD",
      "platform": "looker"
    },
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.ml.churn_features,PROD)",
      "name": "churn_features",
      "type": "DATASET",
      "platform": "snowflake"
    }
  ]
}

This shows that two systems depend on customer_metrics: a Looker dashboard and an ML feature table.

Step 4: Get Both Directions

To see the complete picture, request both directions. Ask:

"Show me the full lineage for customer_metrics, both upstream and downstream"

{
  "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.analytics.customer_metrics,PROD)",
  "upstream": [
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.sales.customers,PROD)",
      "name": "customers",
      "type": "DATASET"
    },
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.sales.orders,PROD)",
      "name": "orders",
      "type": "DATASET"
    }
  ],
  "downstream": [
    {
      "urn": "urn:li:dashboard:(looker,customer_360)",
      "name": "Customer 360 Dashboard",
      "type": "DASHBOARD"
    },
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.ml.churn_features,PROD)",
      "name": "churn_features",
      "type": "DATASET"
    }
  ]
}

Step 5: Control Lineage Depth

By default, lineage shows direct dependencies (depth 1). For deeper traversal, specify the depth. Ask:

"Show me 3 levels of upstream lineage for customer_metrics"

With depth 3, you see the full chain:

{
  "upstream": [
    {
      "urn": "urn:li:dataset:(...,customers,PROD)",
      "name": "customers",
      "level": 1,
      "upstream": [
        {
          "urn": "urn:li:dataset:(...,raw_customers,PROD)",
          "name": "raw_customers",
          "level": 2,
          "upstream": [
            {
              "urn": "urn:li:dataset:(...,customer_events,PROD)",
              "name": "customer_events",
              "level": 3
            }
          ]
        }
      ]
    }
  ]
}

Understanding Depth

  • Depth 1: Direct dependencies only
  • Depth 2: Dependencies of dependencies
  • Depth 3+: Extended lineage chain

Higher depth means more complete lineage but slower queries. The default maximum is 5.

Step 6: Cross-Platform Lineage

DataHub tracks lineage across platforms. Ask:

"What upstream dependencies does the Customer 360 dashboard have?"

{
  "urn": "urn:li:dashboard:(looker,customer_360)",
  "upstream": [
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.analytics.customer_metrics,PROD)",
      "name": "customer_metrics",
      "type": "DATASET",
      "platform": "snowflake"
    },
    {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,prod.analytics.revenue_metrics,PROD)",
      "name": "revenue_metrics",
      "type": "DATASET",
      "platform": "snowflake"
    }
  ]
}

This shows data flowing from Snowflake tables to Looker dashboards.

Lineage Use Cases

Impact Analysis

Before changing a table, check what depends on it:

"What would be impacted if I change the customers table schema?"

Root Cause Analysis

When a dashboard shows wrong data, trace back to the source:

"Where does the Customer 360 dashboard get its data from?"

Data Discovery

Find related datasets by exploring lineage:

"What other datasets are derived from the same sources as customer_metrics?"

Practice Exercises

  1. Find a dataset in your catalog and trace its full lineage
  2. Identify all dashboards that depend on a specific table
  3. Find the original source (depth 3+) for a derived dataset
  4. Discover which ML models consume data from your warehouse

What You Learned

  • Upstream lineage: tracing data sources
  • Downstream lineage: finding data consumers
  • Controlling lineage depth for deeper exploration
  • Cross-platform lineage tracking
  • Practical use cases for lineage analysis

Next Steps