Lineage Model¶
How DataHub tracks data flow and dependencies.
What is Data Lineage?¶
Data lineage tracks the origin, movement, and transformation of data through systems. It answers:
- Where does this data come from?
- What systems use this data?
- How was this data transformed?
- What is the impact of changing this data?
Lineage Directions¶
flowchart LR
subgraph Upstream
A[Source A]
B[Source B]
end
A --> C[Your Dataset]
B --> C
subgraph Downstream
D[Dashboard]
E[ML Model]
F[Report]
end
C --> D
C --> E
C --> F
Upstream Lineage¶
Entities that feed data INTO the target entity.
- Raw data sources
- Staging tables
- Parent datasets in transformations
Downstream Lineage¶
Entities that consume data FROM the target entity.
- Dashboards and reports
- Derived datasets
- ML models
- Other consumers
Lineage Depth¶
Lineage can be traversed to different depths:
flowchart TB
subgraph Depth 3
L3A[Raw Events]
L3B[External API]
end
subgraph Depth 2
L2A[Staging Events]
L2B[Staging Users]
end
subgraph Depth 1
L1A[User Activity]
L1B[User Profiles]
end
subgraph Target
T[Analytics Dataset]
end
L3A --> L2A
L3B --> L2B
L2A --> L1A
L2B --> L1B
L1A --> T
L1B --> T
| Depth | Description |
|---|---|
| 1 | Direct dependencies only |
| 2 | Two hops away |
| 3 | Three hops away |
| N | N hops away |
Higher depth reveals more context but increases query time.
Lineage Types¶
Column-Level Lineage¶
Tracks which columns derive from which source columns.
flowchart LR
subgraph Source Table
s1[customer_id]
s2[first_name]
s3[last_name]
end
subgraph Target Table
t1[id]
t2[full_name]
end
s1 --> t1
s2 --> t2
s3 --> t2
This shows that:
idcomes fromcustomer_idfull_nameis derived fromfirst_nameandlast_name
Table-Level Lineage¶
Tracks which tables are used to create other tables.
flowchart LR
customers[customers] --> customer_orders[customer_orders]
orders[orders] --> customer_orders
customer_orders --> revenue_report[revenue_report]
Cross-Platform Lineage¶
Tracks data across different systems.
flowchart LR
subgraph Postgres
pg[raw_events]
end
subgraph Snowflake
sf_stg[stg_events]
sf_mart[mart_events]
end
subgraph Looker
dash[Events Dashboard]
end
pg --> sf_stg --> sf_mart --> dash
Lineage Sources¶
DataHub collects lineage from multiple sources:
| Source | Method |
|---|---|
| SQL parsing | Parse CREATE TABLE AS, INSERT INTO |
| dbt | dbt manifest and catalog files |
| Airflow | Task dependencies |
| Spark | Spark lineage events |
| Great Expectations | Validation dependencies |
| Manual | UI or API annotations |
Querying Lineage¶
Basic Lineage Query¶
Response Structure¶
{
"urn": "urn:li:dataset:(platform,db.schema.target,PROD)",
"upstream": [
{
"urn": "urn:li:dataset:(platform,db.schema.source1,PROD)",
"name": "source1",
"type": "DATASET",
"platform": "snowflake",
"degree": 1
},
{
"urn": "urn:li:dataset:(platform,db.schema.source2,PROD)",
"name": "source2",
"type": "DATASET",
"platform": "snowflake",
"degree": 1
}
],
"downstream": [
{
"urn": "urn:li:dashboard:(looker,dashboard_id)",
"name": "Sales Dashboard",
"type": "DASHBOARD",
"platform": "looker",
"degree": 1
}
]
}
Understanding Degree¶
The degree field indicates distance from the target:
- Degree 1: Direct dependency
- Degree 2: One hop away
- Degree 3: Two hops away
Use Cases¶
Impact Analysis¶
Before changing a table, understand what depends on it:
flowchart TB
change[Proposed Change]
target[customers table]
d1[Dashboard A]
d2[Dashboard B]
d3[ML Model]
d4[Report Generator]
change --> target
target --> d1
target --> d2
target --> d3
target --> d4
style change fill:#ff9999
style d1 fill:#ffcc99
style d2 fill:#ffcc99
style d3 fill:#ffcc99
style d4 fill:#ffcc99
Questions to answer:
- What dashboards will break?
- What ML models need retraining?
- What reports need updating?
Root Cause Analysis¶
When data quality issues occur, trace back to the source:
flowchart TB
problem[Bad Data in Report]
l1[mart_revenue]
l2[stg_orders]
l3[raw_orders]
root[Source System Bug]
problem --> l1
l1 --> l2
l2 --> l3
l3 --> root
style problem fill:#ff9999
style root fill:#ff9999
Compliance and Governance¶
Track sensitive data through the pipeline:
flowchart LR
pii[PII Source]
transform[Transform]
anonymized[Anonymized Data]
reporting[Reporting]
pii -->|Contains SSN| transform
transform -->|SSN Masked| anonymized
anonymized --> reporting
style pii fill:#ffcc99
Lineage Limitations¶
Not Always Complete¶
- Some transformations are opaque (stored procedures)
- Manual data movements are not tracked
- Some platforms do not emit lineage events
Point-in-Time¶
- Lineage represents current state
- Historical lineage may not be available
- Schema changes can break lineage
Performance Considerations¶
- Deep lineage queries can be slow
- Large lineage graphs consume memory
- Consider caching for repeated queries
Best Practices¶
Start with Depth 1¶
Begin with direct dependencies, then expand:
Filter by Direction¶
If you only need upstream or downstream:
Use for Specific Questions¶
Instead of fetching everything, query for specific needs:
- Impact analysis: downstream depth 2-3
- Root cause: upstream depth 3-5
- Direct dependencies: depth 1
Related Topics¶
- Tutorial: Exploring Lineage: Hands-on lineage exploration
- Entity Types: What entities have lineage
- Tools Reference: Lineage tool parameters