SlideShare a Scribd company logo
Data Lineage and
observability
Julien Le Dem
CTO and co-founder Datakin
@J_
AGENDA
Intro to Marquez
Marquez community
02
04
Why metadata?01
Airflow integration03
Why metadata?01
Need to create a healthy
data ecosystem
Team interdependencies
Team A Team B
Team C
DATA
● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
Today: Limited context
Maslow’s Data hierarchy of needs
New Business Opportunities
Business optimization
Data Quality
Data Freshness
Data Availability
Intro to Marquez02
Data
Operations
Data
Governance
Data
Discovery
Marquez
http://cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf
Metadata (Marquez)
Ingest Storage Compute
StreamingBatch/ETL
● Data Platform
built around
Marquez
● Integrations
○ Ingest
○ Storage
○ Compute
Flink
Airflow
Kafka
Iceberg / S3
BI
Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1*
1*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version
Marquez: Data model
DbTable Filesystem Stream
Job
Dataset Job Version
Run
*
1
*
1
*
1
1*
1*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version
v1 v4Dataset
v2
v4
v4
Job
v1
Dataset
v4
Job
v2
Marquez: Data model
● Debugging
○ What job version(s) produced and
consumed dataset version X?
● Backfilling
○ Full / incremental processing
Design benefits
Marquez: Metadata collection
How is metadata collected?
● Push-based metadata
collection
● REST API
● Language-specific SDKs
○ Java
○ Python
Marquez
Job
Dataset+job
metadata
● Centralized metadata
management
○ Sources
○ Datasets
○ Jobs
● Modular framework
○ Data governance
○ Data lineage
○ Data discovery +
exploration
Metadata Service
Marquez: Design
Marquez
Core
Lineage
Search
REST API
ETL Batch Stream
Extensions
datakin
Lineage
analysis
Lineage collectionAPIs
Integrations
Client -
side
Metadata
Core
DB
Graph
Storage
Marquez UI
Listener
Core API
Marquez: Metadata collection
Source
{
"type":"POSTGRESQL",
"name":"analyticsdb”,
"connectionUrl":"jdbc:postgresql://localhost:5431/analytics”,
"description":“Contains tables such as office room bookings.”
}
01
Marquez: Metadata collection
{
"type":"POSTGRESQL",
"name":"analyticsdb”,
"connectionUrl":"jdbc:postgresql://localhost:5431/analytics”,
"description":“Contains tables such as office room bookings.”
}
{
"type":"DB_TABLE",
"name":"room_bookings”,
"physicalName":"public.room_bookings”,
"sourceName":"analyticsdb”,
"namespace":"datascience",
"fields": [...],
"description":“All global room bookings for each office.”
}
02 Dataset
Source01
Marquez: Metadata collection
{
"type":"POSTGRESQL",
"name":"analyticsdb”,
"connectionUrl":"jdbc:postgresql://localhost:5431/analytics”,
"description":“Contains tables such as office room bookings.”
}
{
"type":"DB_TABLE",
"name":"room_bookings”,
"physicalName":"public.room_bookings”,
"sourceName":"analyticsdb”,
"namespace":"datascience”,
"fields": [...],
"description":“All global room bookings for each office.”
}
{
"type":"BATCH",
"name":"room_bookings_7_days”,
"inputs":[{"namespace":"datascience","name":"room_bookings”}],
"outputs":[],
"location":"https://github.com/jobs/blob/124f6089...”,
"namespace":"datascience",
"description":“Weekly email of room bookings occupancy patterns.”
}
03 Job
Source01
02 Dataset
Marquez: Metadata collection
{
"type":"POSTGRESQL",
"name":"analyticsdb”,
"connectionUrl":"jdbc:postgresql://localhost:5431/analytics”,
"description":“Contains tables such as office room bookings.”
}
{
"type":"DB_TABLE",
"name":"room_bookings”,
"physicalName":"public.room_bookings”,
"sourceName":"analyticsdb”,
"namespace":"datascience”,
"fields": [...],
"description":“All global room bookings for each office.”
}
{
"type":"BATCH",
"name":"room_bookings_7_days”,
"inputs":[{"namespace":"datascience","name":"room_bookings”}],
"outputs":[],
"location":"https://github.com/jobs/blob/124f6089...”,
"namespace":"datascience”,
"description":“Weekly email of room bookings occupancy patterns.”
}
03 Job
Source01
LINK SOURCE
LINK DATASET
02 Dataset
01 Job
v1
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":[{
"namespace":"datascience",
"name":"room_bookings”
}],
"outputs":[],
...
}
LINEAGE
JOBDATASET
Marquez: Metadata collection
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":[{
"namespace":"datascience",
"name":"room_bookings”
}],
"outputs":[],
...
}
JOBDATASET
Marquez: Metadata collection
02 Job
v2
{
"type":"BATCH",
"name":"room_bookings_7_days”
"inputs":[{
"namespace":"datascience",
"name":"room_bookings”
}],
"outputs":[{
"namespace":"datascience",
"name":"room_bookings_aggs”
}],
...
}
LINEAGE
LINEAGE
01 Job
v1
Airflow integration03
Airflow
DAG
DAG
DAG
DAG
Marquez Lib.
● Metadata
○ Task lifecycle
○ Task parameters
○ Task runs linked to versioned code
○ Task inputs / outputs
● Lineage
○ Track inter-DAG dependencies
● Built-in
○ SQL parser
○ Link to code builder (GitHub)
○ Metadata extractors
Marquez: Airflow
Airflow support for Marquez
DAG
MarquezLib.
Integration
Marquez
RESTAPI
Capturing task-level metadata in a
nutshell
Marquez: Airflow
Job
Dataset
Job
Version
Run
Dataset
Version
*
1
*
1
1*
1*
Source
1 *
*
1
Airflow
● Open source: marquez-airflow
● Enables global task-level metadata collection
● Extends Airflow’s DAG class
from marquez_airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
...
room_bookings_7_days_dag.py
Marquez: Airflow
Marquez Airflow Lib.
airflow.operators.PostgresOperator
marquez_airflow.extractors.PostgresExtractor
Extractor
Operator
Metadata
Airflow
Marquez Airflow
Lib.
Example
Marquez: Airflow
Marquez: Airflow
t1=PostgresOperator(
task_id=’new_room_booking’,
postgres_conn_id=’analyticsdb’,
sql=’’’
INSERT INTO room_bookings VALUES(%s, %s, %s)
’’’
parameters=... # room booking
)
Operator Metadata
Source01
new_room_booking_dag.py
Marquez: Airflow
t1=PostgresOperator(
task_id=’new_room_booking’,
postgres_conn_id=’analyticsdb’,
sql=’’’
INSERT INTO room_bookings VALUES(%s, %s, %s)
’’’
parameters=... # room booking
)
Operator Metadata
Source01
02 Dataset
new_room_booking_dag.py
Marquez: Airflow
t1=PostgresOperator(
task_id=’new_room_booking’,
postgres_conn_id=’analyticsdb’,
sql=’’’
INSERT INTO room_bookings VALUES(%s, %s, %s)
’’’
parameters=... # room booking
)
Operator Metadata
02 Dataset
03 Job
new_room_booking_dag.py
Source01
Marquez: Airflow
new_room_bookings_dag.py top_room_bookings_dag.py
Managing inter-DAG dependencies
Marquez: Airflow
new_room_bookings_dag.py top_room_bookings_dag.py
Managing inter-DAG dependencies
b940314,1541624285,2
TSLOCATION ROOM
b648485,1541501885,9
b648485,1541710685,4
public.room_bookings
Marquez
API
● Marquez standardizes metadata collection
○ Job runs
○ parameters
○ version
○ inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
datakin
Lineage analysis
Graph
Integrations
Community04
https://marquezproject.github.io/marquez
Neutral
● Not controlled by
a company
● Community
driven
Community
● Build trust
● Grow adoption
● Everybody is on
an equal footing
Governance
● Decision
mechanisms
● Becoming a
maintainer
● Code of Conduct
Now part of the LF AI foundation
github.com/MarquezProject
@MarquezProject
Thanks! <o/
Questions?

More Related Content

PPTX
DW Migration Webinar-March 2022.pptx
PDF
Modernizing to a Cloud Data Architecture
PDF
Building a Data Strategy – Practical Steps for Aligning with Business Goals
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
Free Training: How to Build a Lakehouse
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PDF
Introducing Databricks Delta
DW Migration Webinar-March 2022.pptx
Modernizing to a Cloud Data Architecture
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Free Training: How to Build a Lakehouse
Building Lakehouses on Delta Lake with SQL Analytics Primer
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Introducing Databricks Delta

What's hot (20)

PDF
Future of Data Engineering
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PDF
Data Mesh
PPTX
Big data architectures and the data lake
PDF
Intro to Delta Lake
PDF
Observability for Data Pipelines With OpenLineage
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
Lakehouse in Azure
PDF
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
PDF
Databricks: A Tool That Empowers You To Do More With Data
PPTX
Databricks Fundamentals
PPTX
Data Lakehouse Symposium | Day 4
PPTX
Databricks for Dummies
PDF
Making Apache Spark Better with Delta Lake
PDF
Data Mesh Part 4 Monolith to Mesh
PPTX
Introduction to Azure Databricks
PPTX
Building Modern Data Platform with Microsoft Azure
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Data Mesh 101
PDF
Learn to Use Databricks for the Full ML Lifecycle
Future of Data Engineering
Enabling a Data Mesh Architecture with Data Virtualization
Data Mesh
Big data architectures and the data lake
Intro to Delta Lake
Observability for Data Pipelines With OpenLineage
Architect’s Open-Source Guide for a Data Mesh Architecture
Lakehouse in Azure
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Databricks: A Tool That Empowers You To Do More With Data
Databricks Fundamentals
Data Lakehouse Symposium | Day 4
Databricks for Dummies
Making Apache Spark Better with Delta Lake
Data Mesh Part 4 Monolith to Mesh
Introduction to Azure Databricks
Building Modern Data Platform with Microsoft Azure
Democratizing Data Quality Through a Centralized Platform
Data Mesh 101
Learn to Use Databricks for the Full ML Lifecycle
Ad

Similar to Data lineage and observability with Marquez - subsurface 2020 (20)

PDF
Data Lineage with Apache Airflow using Marquez
PDF
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
PDF
Open core summit: Observability for data pipelines with OpenLineage
PDF
Tracking data lineage at Stitch Fix
PDF
Data and AI summit: data pipelines observability with open lineage
PDF
Data platform architecture principles - ieee infrastructure 2020
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
PDF
Data pipelines observability: OpenLineage & Marquez
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
PPTX
Data governance datalakes_multitenancy
PDF
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
PPTX
Liberate Legacy Data Sources with Precisely and Databricks
PPTX
Spline: Data Lineage For Spark Structured Streaming
PPTX
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
PPTX
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...
PDF
Invited talk @ DCC09 workshop
PPTX
Chen li asterix db: 大数据处理开源平台
PDF
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
PDF
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Data Lineage with Apache Airflow using Marquez
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
Open core summit: Observability for data pipelines with OpenLineage
Tracking data lineage at Stitch Fix
Data and AI summit: data pipelines observability with open lineage
Data platform architecture principles - ieee infrastructure 2020
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Data pipelines observability: OpenLineage & Marquez
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data governance datalakes_multitenancy
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
Liberate Legacy Data Sources with Precisely and Databricks
Spline: Data Lineage For Spark Structured Streaming
Hortonworks Data Platform and IBM Systems - A Complete Solution for Cognitive...
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...
Invited talk @ DCC09 workshop
Chen li asterix db: 大数据处理开源平台
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Next CERN Accelerator Logging Service with Jakub Wozniak
Ad

More from Julien Le Dem (19)

PPTX
Strata NY 2018: The deconstructed database
PDF
From flat files to deconstructed database
PPTX
Strata NY 2017 Parquet Arrow roadmap
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
PPTX
Mule soft mar 2017 Parquet Arrow
PPTX
Data Eng Conf NY Nov 2016 Parquet Arrow
PPTX
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
PDF
Strata London 2016: The future of column oriented data processing with Arrow ...
PDF
Sql on everything with drill
PDF
If you have your own Columnar format, stop now and use Parquet 😛
PDF
How to use Parquet as a basis for ETL and analytics
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
PDF
Parquet Strata/Hadoop World, New York 2013
PDF
Parquet Hadoop Summit 2013
PDF
Parquet Twitter Seattle open house
PPT
Parquet overview
PPTX
Poster Hadoop summit 2011: pig embedding in scripting languages
PPTX
Embedding Pig in scripting languages
Strata NY 2018: The deconstructed database
From flat files to deconstructed database
Strata NY 2017 Parquet Arrow roadmap
The columnar roadmap: Apache Parquet and Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Mule soft mar 2017 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Sql on everything with drill
If you have your own Columnar format, stop now and use Parquet 😛
How to use Parquet as a basis for ETL and analytics
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Parquet Strata/Hadoop World, New York 2013
Parquet Hadoop Summit 2013
Parquet Twitter Seattle open house
Parquet overview
Poster Hadoop summit 2011: pig embedding in scripting languages
Embedding Pig in scripting languages

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
cuic standard and advanced reporting.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
DOCX
The AUB Centre for AI in Media Proposal.docx
 
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
MYSQL Presentation for SQL database connectivity
A Presentation on Artificial Intelligence
Encapsulation_ Review paper, used for researhc scholars
NewMind AI Weekly Chronicles - August'25 Week I
Mobile App Security Testing_ A Comprehensive Guide.pdf
20250228 LYD VKU AI Blended-Learning.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
cuic standard and advanced reporting.pdf
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
The AUB Centre for AI in Media Proposal.docx
 
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Data lineage and observability with Marquez - subsurface 2020