SlideShare a Scribd company logo
Data lineage and observability
with OpenLineage
Julien Le Dem, CTO and Co-Founder Datakin | Mai 2021
AGENDA
● The need for metadata
● OpenLineage - the open standard for lineage
collection - and Marquez, its reference
implementation
● Spark observability with OpenLineage
The need for Metadata
3
Building a healthy data ecosystem
Team A Team B
Team C
5
Today: Limited context
● What is the data source?
● What is the schema?
● Who is the owner?
● How often is it updated?
● Where is it coming from?
● Who is using the data?
● What has changed?
DATA
Maslow’s Data hierarchy of needs
New Business Opportunities
Business Optimization
Data Quality
Data Freshness
Data Availability
OpenLineage
7
OpenLineage contributors
Creators and contributors from major open source projects involved
Purpose:
Define an Open standard for
metadata and lineage collection
by instrumenting data pipelines
as they are running.
Purpose:
EXIF for data pipelines
Problem
Before:
● Duplication of effort: Each project
has to instrument all jobs
● Integrations are external and can
break with new versions
● Effort of integration is shared
● Integration can be pushed in
each project: no need to play
catch up
With Open Lineage
Open Lineage scope Not in scope
Backend
Integrations
Metadata
and
lineage
collection
standard
Warehouse
Schedulers
...
Kafka
topic
Graph
db
HTTP
client
Consumers
Kafka
client
GraphDB
client
...
Core Model:
- JSONSchema spec
- Consistent naming:
Jobs:
scheduler.job.task
Datasets:
instance.schema.table
13
14
Protocol:
- Asynchronous events:
Unique run id for identifying a
run and correlate events
- Configurable backend:
- Kafka
- Http
Examples:
● Run Start event
○ source code version
○ run parameters
● Run Complete event
○ input dataset
○ output dataset version and schema
15
Facets
● Extensible:
Facets are atomic pieces of metadata
identified by a unique name that can be
attached to the core entities.
● Decentralized:
Prefixes in facet names allow the
definition of Custom facets that can be
promoted to the spec at a later point.
Facet examples
Dataset:
- Stats
- Schema
- Version
- Column level
lineage
Job:
- Source code
- Dependencies
- params
- Source control
- Query plan
- Query profile
Run:
- Schedule time
- Batch id
17
Metadata:
Ingest Storage Compute
Streaming
Batch/ML
● Data Platform
built around
Marquez
● Integrations
○ Ingest
○ Storage
○ Compute
Flink
Airflow
Kafka
Iceberg / S3
BI
OpenLineage
Marquez: Data model
Job
Dataset Job Version
Run
*
1
*
1
*
1
1
*
1
*
Source
1 *
● MYSQL
● POSTGRESQL
● REDSHIFT
● SNOWFLAKE
● KAFKA
● S3
● ICEBERG
● DELTALAKE
● BATCH
● STREAM
● SERVICE
Dataset Version
API
● Open Lineage and Marquez standardize
metadata collection
○ Job runs
○ Parameters
○ Version
○ Inputs / outputs
● Datakin enables
○ Understanding operational dependencies
○ Impact analysis
○ Troubleshooting: What has changed
since the last time it worked?
Datakin leverages Marquez metadata
Lineage analysis
Graph
Integrations
Spark observability with
OpenLineage
21
22
Spark java agent
spark.driver.extraJavaOptions:
-javaagent:marquez-spark-0.13.1.jar={argument}
Metadata
collected
23
Lineage: inputs/outputs
Data volume: row count/byte size
Logical plan
Lineage model
24
Lineage Example across jobs
26
Example of
OpenLineage
metadata usage:
Data volume
evolution
Join the conversation
OpenLineage:
Github: github.com/OpenLineage
Slack: OpenLineage.slack.com
Twitter: @OpenLineage
Email: groups.google.com/g/openlineage
Marquez:
Github: github.com/MarquezProject/marquez
Slack: MarquezProject.slack.com
Twitter: @MarquezProject

More Related Content

PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PPTX
Free Training: How to Build a Lakehouse
PDF
Data lineage and observability with Marquez - subsurface 2020
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Intro to Delta Lake
PDF
Introducing Databricks Delta
PDF
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
PDF
Monitoring Flink with Prometheus
Building Lakehouses on Delta Lake with SQL Analytics Primer
Free Training: How to Build a Lakehouse
Data lineage and observability with Marquez - subsurface 2020
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Intro to Delta Lake
Introducing Databricks Delta
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Monitoring Flink with Prometheus

What's hot (20)

PPTX
DW Migration Webinar-March 2022.pptx
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
Databricks Delta Lake and Its Benefits
PPTX
Big data architectures and the data lake
PDF
How a Semantic Layer Makes Data Mesh Work at Scale
PPTX
Databricks for Dummies
PDF
Data and AI summit: data pipelines observability with open lineage
PDF
CDC patterns in Apache Kafka®
PDF
Making Apache Spark Better with Delta Lake
PPTX
Building a modern data warehouse
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Azure Data Factory v2
PDF
Webinar Data Mesh - Part 3
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
PDF
PDF
PDF
Business Intelligence (BI) and Data Management Basics
PPTX
Inside open metadata—the deep dive
DW Migration Webinar-March 2022.pptx
Data Mesh Part 4 Monolith to Mesh
Databricks Delta Lake and Its Benefits
Big data architectures and the data lake
How a Semantic Layer Makes Data Mesh Work at Scale
Databricks for Dummies
Data and AI summit: data pipelines observability with open lineage
CDC patterns in Apache Kafka®
Making Apache Spark Better with Delta Lake
Building a modern data warehouse
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Enabling a Data Mesh Architecture with Data Virtualization
Democratizing Data Quality Through a Centralized Platform
Azure Data Factory v2
Webinar Data Mesh - Part 3
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Business Intelligence (BI) and Data Management Basics
Inside open metadata—the deep dive
Ad

Similar to Observability for Data Pipelines With OpenLineage (20)

PDF
Data pipelines observability: OpenLineage & Marquez
PDF
Open core summit: Observability for data pipelines with OpenLineage
PDF
Structured Streaming in Spark
PDF
Upleveling Analytics with Kafka with Amy Chen
PPTX
Deploying Data Science Engines to Production
PPTX
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
PDF
Machine learning at scale with Google Cloud Platform
PDF
Data platform architecture principles - ieee infrastructure 2020
PDF
Gobblin @ NerdWallet (Nov 2015)
PDF
SDN in the Management Plane: OpenConfig and Streaming Telemetry
PDF
Graph Data Science at Scale
PDF
Enterprise guide to building a Data Mesh
PDF
Big data processing systems research
PDF
Managing Apache Spark Workload and Automatic Optimizing
PDF
Data Platform in the Cloud
PPTX
Salesforce integration best practices columbus meetup
PDF
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
PDF
Spark Driven Big Data Analytics
PDF
Clearing Airflow Obstructions
PDF
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
Data pipelines observability: OpenLineage & Marquez
Open core summit: Observability for data pipelines with OpenLineage
Structured Streaming in Spark
Upleveling Analytics with Kafka with Amy Chen
Deploying Data Science Engines to Production
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
Machine learning at scale with Google Cloud Platform
Data platform architecture principles - ieee infrastructure 2020
Gobblin @ NerdWallet (Nov 2015)
SDN in the Management Plane: OpenConfig and Streaming Telemetry
Graph Data Science at Scale
Enterprise guide to building a Data Mesh
Big data processing systems research
Managing Apache Spark Workload and Automatic Optimizing
Data Platform in the Cloud
Salesforce integration best practices columbus meetup
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Spark Driven Big Data Analytics
Clearing Airflow Obstructions
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection
Jeeves Grows Up: An AI Chatbot for Performance and Quality

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Quality review (1)_presentation of this 21
PDF
Introduction to Business Data Analytics.
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Global journeys: estimating international migration
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Foundation of Data Science unit number two notes
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
1_Introduction to advance data techniques.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
Lecture1 pattern recognition............
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Moving the Public Sector (Government) to a Digital Adoption
IBA_Chapter_11_Slides_Final_Accessible.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Quality review (1)_presentation of this 21
Introduction to Business Data Analytics.
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Global journeys: estimating international migration
IB Computer Science - Internal Assessment.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Foundation of Data Science unit number two notes
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Clinical guidelines as a resource for EBP(1).pdf
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Major-Components-ofNKJNNKNKNKNKronment.pptx

Observability for Data Pipelines With OpenLineage