SlideShare a Scribd company logo
Data Platform
Marquez:
A Metadata Service for Data Abstraction, Data Lineage,
and Event-based Triggers
DataEngConf NYC ‘18
Data Platform
Hey!
I’m Willy Lulciuc
Data Engineer
Marquez Team, Data Platform
@wslulciuc
Data Platform
Space01
Community02
Services03
Data Platform
268,000
members globally
287
physical locations
72
cities
23
countries
Data Platform
AGENDA
Room bookings pipeline (naïve)
Intro to Marquez
Room bookings pipeline (take 2)
02
03
04
@wslulciuc
Future work05
Why metadata?01
Why metadata?01
Data lineage
● Add context to
data
Democratize
● Self-service data
culture
Data quality
● Build trust in
data
Why manage and utilize metadata?
Data Platform
… creating a healthy data
ecosystem
Freedom
● Experiment
● Flexible
● Self-sufficient
Accountability
● Cost
● Trust
Self-service
● Discover
● Explore
● Global context
A healthy data ecosystem
Data Platform
Data Platform
Let’s get
booking!
Location + floor01
Data Platform
Data Platform
Location + floor01
Open time slot02
Data Platform
Location + floor01
Open time slot02
Duration03
Data Platform
Location + floor01
Open time slot02
Duration03
Confirm04
Which location has
the most bookings?
Data Platform
Set[RoomBooking] LocationID
Room bookings pipeline
(naïve)
02
Data Platform
@wslulciuc
Requirements
Example: Room bookings pipeline (naïve)
● Read room bookings
● Sum room bookings by location
● Write top location
● Run once an hour
Read SumStart Write
Data Platform
@wslulciucExample: Room bookings pipeline (naïve)
S3
Postgres
.csv
.csv
Data Platform
@wslulciucExample: Room bookings pipeline (naïve)
S3
Postgres
.csv
.csv
b940314,1541624285,2
TSLOCATION ROOM
b648485,1541501885,9
b648485,1541710685,4
Data Platform
@wslulciucExample: Room bookings pipeline (naïve)
S3
Postgres
.csv
.csv
b940314,1541624285,2
1 b648485 1541721600 2
TSLOCATION ROOM
LOCATIONID TS BOOKINGS
b648485,1541501885,9
b648485,1541710685,4
Data Platform
Example: Room bookings pipeline (naïve) @wslulciuc
Job
Scheduler
Upstream Downstream
S3 Postgres
Room Bookings
Job
Archival Top Locations
Workflow
We’re live!
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
Example: Room bookings pipeline (naïve)
Data Platform
Example: Room bookings pipeline (naïve) @wslulciuc
Job
Scheduler
Upstream Downstream
S3 Postgres
Room Bookings
Job
Archival Top Locations
Workflow
Curses, our job’s failing …
Data Platform
Example: Room bookings pipeline (naïve) @wslulciuc
Job
Scheduler
Upstream Downstream
S3 Postgres
Room Bookings
Job
Archival Top Locations
Workflow
Oh, might be our input data!
Data Platform
@wslulciucExample: Room bookings pipeline (naïve)
S3
.csv
.csv
Room field is of type string
b648485,1541501885,9A
b940314,1541624285,2G
b648485,1541710685,4F
TSLOCATION ROOM
int
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
Example: Room bookings pipeline (naïve)
Data Platform
Example: Room bookings pipeline (naïve) @wslulciuc
Job
Scheduler
Upstream Downstream
S3 Postgres
Room Bookings
Job
Archival Top Locations
Workflow
Ugh, gaps in output data
Data Platform
Example: Room bookings pipeline (naïve) @wslulciuc
00h 01h 02h 03h 04h 05h 06h 07h 08h 09h
Backfills!
time partitions
latest
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (naïve)
Data Platform
@wslulciucExample: Room bookings pipeline (naïve)
Job
Scheduler
S3 Postgres
Room Bookings
Workflow
What we have so far …
Data Platform
@wslulciucExample: Room bookings pipeline (naïve)
Job
Scheduler
S3 Postgres
What we have so far … Problems
● What’s our job’s input
dataset?
● Does the dataset have
an owner?
● How often is the
dataset updated?
● Coordinate changes
● Figure out backfillsRoom Bookings
Workflow
… writing a job shouldn’t be
this hard!
Intro to Marquez04
Data Platform
Metadata Service
● Centralized metadata
management
○ Jobs
○ Datasets
● Modular
○ Data discovery
○ Data health
○ Data triggers
Marquez: Design @wslulciuc
Clients
(JVM)
Clients
(Python)
Marquez
Search
Health
Triggers
REST API
Data Platform
Module: Search
● Unified search
● Documentation
○ Owner
○ Schema
○ Datasource
@wslulciuc
Marquez
Search
Health
Triggers
Marquez: Data discovery
@wslulciucMarquez: Data discovery
room bo
Room Bookings (SF)
All
created: jul. 8, 2018
Room Booking Metrics (GLBL)
created: feb. 15, 2010
All San Francisco room bookings
Global room booking metrics
Search
Datasets
TagsS3
Data Platform
Module: Health
● Owner
○ Team / project
● Schema
● Location
● Description
● Size
○ Growth over time
○ Number of records
● Lineage
@wslulciuc
Marquez
Search
Health
Triggers
Marquez: Data health
Data graph
Dataset
Job
Lineage queries!
Dataset
Job
Lineage
Data Platform
Module: Triggers
● Timely processing of data
○ No polling!
● Reduce manual handling of
backfills
● Reduce production of bad
data
○ Incomplete data
○ Low-quality data
@wslulciuc
Marquez
Search
Health
Triggers
Marquez: Data triggers
Dataset
Job
Upstream failure
detection!
Job failure
Dataset
Job
Affected paths!
Job failure
Cascading triggers!
Dataset
Job
Trigger
Core concepts
Data Platform
Job + Datasets
Input
Dataset
Output
Dataset
Job
@wslulciucMarquez: Core concepts
Data Platform
Dataset versions!
@wslulciucMarquez: Core concepts
A dataset version
contains a
complete snapshot
of data as of some
point in time
v1 v1
v2 v2
v3
Job
Data Platform
Deltas “diffs”!
v1 v1
v2 v2
v3
Job
@wslulciucMarquez: Core concepts
INSERT INTO room_bookings (location, bookings)
VALUES (b648485, 2)
Data Platform
Deltas “diffs”!
v1 v1
v2 v2
v3
Job
@wslulciucMarquez: Core concepts
Δv2→v3
INSERT INTO room_bookings (location, bookings)
VALUES (b648485, 2)
Data Platform
Job versions!
@wslulciucMarquez: Core concepts
A job version is created
when business logic has
changed
v1 v1
v2 v2
v3
Job
v1
Job
v2
Data Platform
Job runs!
v1 v1
v2 v2
v3
Job
v1
@wslulciucMarquez: Core concepts
Job
Dataset
New Run
Job
v2
Data Platform
Job runs!
v1 v1
v2 v2
v3
Job
v1
@wslulciucMarquez: Core concepts
Dataset
New Run
v4
Job
Job
v2
Data Platform
Job runs!
v1 v1
v2 v2
v3
Job
v1
@wslulciucMarquez: Core concepts
Dataset
New Run
v4
Finish
Update
Job
Job
v2
Data Platform
Data triggers!
v1 v1
v2 v2
v3
Job
v1
@wslulciucMarquez: Core concepts
Dataset
New Run
v4
Trigger
Job
v7
Job
v10
Job
Update
Finish
Job
v2
Data Platform
Job failures!
v1 v1
v2 v2
v3
Job
v1
@wslulciucMarquez: Core concepts
Dataset
New Run FailureJob
v4
Job
v2
Data Platform
Delayed datasets!
v1 v1
v2 v2
v3
Job
v1
@wslulciucMarquez: Core concepts
Dataset
New RunJob
v4
Job
v2
Failure
Delay
Data Platform
Design benefits
@wslulciucMarquez: Core concepts
● Early upstream failure detection
● Debugging
○ What job version(s) produced /
consumed dataset version X?
● Recoverability
○ Full / incremental processing
● Coordination
Data model
Job
Marquez: Data model @wslulciuc
Dataset JobVersion
JobRunDatasetVersion
*
1
*
1
*
1
1*
1*
Marquez: Data model @wslulciuc
DbTable
Filesystem
Stream
Datasource
Types
Job
Dataset JobVersion
JobRunDatasetVersion
*
1
*
1
*
1
1*
1*
Metadata collection
Data Platform
@wslulciucMarquez: Metadata collection
How is metadata collected?
● Marquez API
● Language-specific SDKs
○ Java
○ Python
Marquez
Job
record
metadata
Data Platform
@wslulciucMarquez: Metadata collection
Workflow
Register
Job
● Job version
● Inputs / outputs
(logical names)
● Owner
● Description
Data Platform
@wslulciucMarquez: Metadata collection
Register
Job
● Job version
● Inputs / outputs
(logical names)
● Owner
● Description
Register
Job Run
Workflow
Data Platform
@wslulciucMarquez: Metadata collection
Register
Job
● Job version
● Inputs / outputs
(logical names)
● Owner
● Description
Register
Job Run
Start
● Update job
run state to
STARTED
Complete
● Update job
run state to
COMPLETED
Workflow
Data Platform
@wslulciucMarquez: Metadata collection
Register
Job
● Job version
● Inputs / outputs
(logical names)
● Owner
● Description
Register
Job Run
Start
● Update job
run state to
STARTED
Complete
● Update job
run state to
COMPLETED
Register
Job Run
Outputs
● Outputs (physical
locations)
Workflow
Room bookings pipeline
(take 2)
04
Data Platform
Example: Room bookings pipeline (take 2) @wslulciuc
Recall, we are tasked with analyzing
room booking trends …
Data Platform
Example: Room bookings pipeline (take 2) @wslulciuc
Job Postgres
Room Bookings
Workflow
Top Locations
S3
Scheduler
Recall, we are tasked with analyzing
room booking trends …
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
Enter Marquez
@wslulciuc
room bo
Room Bookings (ALL)
All
created: feb. 15, 2010
Room Bookings (SF)
created: jul. 8, 2018
All room bookings since beginning of time
All San Francisco room bookings
Example: Room bookings pipeline (take 2)
Data Platform
S3
S3
@wslulciuc
room bo
All
Room Bookings (SF)
created: jul. 8, 2018All San Francisco room bookings
Example: Room bookings pipeline (take 2)
Well, that
was easy!
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Data Platform
S3
S3
@wslulciucExample: Room bookings pipeline (take 2)
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Owner: Data Engineering
Location: s3://room_bookings/raw/
Info
Schema: https://registry.wework.com/schemas/ids/1
Updated: Hourly
Data Platform
Description: All room bookings since beginning of time
S3
@wslulciucExample: Room bookings pipeline (take 2)
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Owner: Data Engineering
Location: s3://room_bookings/raw/
Info
Schema: https://registry.wework.com/schemas/ids/1
Updated: Hourly
Data Platform
Description: All room bookings since beginning of time
S3
@wslulciucExample: Room bookings pipeline (take 2)
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Owner: Data Engineering
Location: s3://room_bookings/raw/
Info
Schema: https://registry.wework.com/schemas/ids/1
Updated: Hourly
Data Platform
Description: All room bookings since beginning of time
Bonus!
S3
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
Data Platform
Example: Room bookings pipeline (take 2) @wslulciuc
Job Postgres
Room Bookings
Workflow
Top Locations
S3
We also had to coordinate changes to
our input data
Scheduler
Our view
Dataset
Job
Job failure
Room bookings
workflow
Global view!
Dataset
Job
Job failure
Room bookings
workflow
Top locations
dataset
@wslulciucExample: Room bookings pipeline (take 2)
Room Bookings (ALL)
created: feb. 15, 2010All room bookings since beginning of time
Owner: Data Engineering
Location: s3://room_bookings/raw/
Info
Schema: https://registry.wework.com/schemas/ids/2
Updated: Hourly
Data Platform
Description: All room bookings since beginning of time
Oh, version
bumped!
S3
Patch, deploy, trigger!
Dataset
Job
Room bookings
workflow
Top locations
dataset
Trigger
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
Data Platform
@wslulciuc
Problems
● What’s our job’s input dataset?
● Does the dataset have an owner?
● How often is the dataset updated?
● Coordinate changes
● Figure out backfills
Example: Room bookings pipeline (take 2)
Data Platform
@wslulciuc
RECAP
● Make it trival to discovery datasets
● Global context when debugging
● Easily handle backfills
○ Datasets as dependencies
Future work05
Data Platform
WeWork + Marquez
● Data platform built around Marquez
● Internal integrations
○ Scheduling
○ Batching
○ Streaming
@wslulciucMarquez: Future work
Data Platform
Roadmap
● Short-term
○ Release Marquez 0.1.0
○ Docs
● Long-term
○ Marquez UI
@wslulciucMarquez: Future work
github.com/MarquezProject
@MarquezProject
Thanks!
Data Platform DataEngConf NYC ‘18
Data Platform
We’re
hiring!
contact: willy.lulciuc@wework.com
Questions?
Data Platform DataEngConf NYC ‘18

More Related Content

PDF
Data Lineage with Apache Airflow using Marquez
PPTX
Microsoft Azure Technical Overview
PDF
Azure Pipelines Multistage YAML - Top 10 Features
PPTX
Azure Data Factory Data Flows Training (Sept 2020 Update)
PPTX
Introduction to Microsoft Azure
PPTX
Azure cosmos db, Azure no-SQL database,
PDF
Introduction to Azure Data Lake
PDF
Unified Stream and Batch Processing with Apache Flink
Data Lineage with Apache Airflow using Marquez
Microsoft Azure Technical Overview
Azure Pipelines Multistage YAML - Top 10 Features
Azure Data Factory Data Flows Training (Sept 2020 Update)
Introduction to Microsoft Azure
Azure cosmos db, Azure no-SQL database,
Introduction to Azure Data Lake
Unified Stream and Batch Processing with Apache Flink

What's hot (20)

PDF
The Microsoft Well Architected Framework For Data Analytics
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Azure Hybid
PDF
Django Rest Framework - Building a Web API
PDF
As-Isシステム分析は入出力から始めよ
PPTX
Data Federation with Apache Spark
PPTX
Azure Active Directory - An Introduction
PPTX
Microsoft azure
PPTX
Azure Data Factory Data Flow
PDF
Azure Data Factory V2; The Data Flows
PPT
Ms sql server architecture
PDF
Intelligent, Automatic Restarts for Unhealthy Kafka Consumers on Kubernetes w...
PPTX
Databricks on AWS.pptx
PDF
OSMC 2021 | Introduction into OpenSearch
PDF
ElasticSearch
PPTX
Azure CosmosDB the new frontier of big data and nosql
PDF
ETL Made Easy with Azure Data Factory and Azure Databricks
PDF
Aula 08 - Introdução ao banco de dados MySQL - Programação Web
PDF
The business today - PowerApps, Power BI y Microsoft Flow
PPTX
Log management system for Microservices
The Microsoft Well Architected Framework For Data Analytics
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Azure Hybid
Django Rest Framework - Building a Web API
As-Isシステム分析は入出力から始めよ
Data Federation with Apache Spark
Azure Active Directory - An Introduction
Microsoft azure
Azure Data Factory Data Flow
Azure Data Factory V2; The Data Flows
Ms sql server architecture
Intelligent, Automatic Restarts for Unhealthy Kafka Consumers on Kubernetes w...
Databricks on AWS.pptx
OSMC 2021 | Introduction into OpenSearch
ElasticSearch
Azure CosmosDB the new frontier of big data and nosql
ETL Made Easy with Azure Data Factory and Azure Databricks
Aula 08 - Introdução ao banco de dados MySQL - Programação Web
The business today - PowerApps, Power BI y Microsoft Flow
Log management system for Microservices
Ad

Similar to Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers (20)

PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
PPTX
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
PDF
Keepin’ It Real(-Time) With Nadine Farah | Current 2022
PDF
Technologies, Data Analytics Service and Enterprise Business
PDF
Streaming SQL Foundations: Why I ❤ Streams+Tables
PDF
Data Pipline Observability meetup
PDF
Data Day Texas 2017: Scaling Data Science at Stitch Fix
PDF
The Big Bad Data
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
PDF
Workflow Hacks #1 - dots. Tokyo
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
PDF
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
PDF
Architecting a next generation data platform
PDF
Uncovering SQL Server query problems with execution plans - Tony Davis
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
PDF
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
PPTX
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
PDF
MicroStrategy at Badoo
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Keepin’ It Real(-Time) With Nadine Farah | Current 2022
Technologies, Data Analytics Service and Enterprise Business
Streaming SQL Foundations: Why I ❤ Streams+Tables
Data Pipline Observability meetup
Data Day Texas 2017: Scaling Data Science at Stitch Fix
The Big Bad Data
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Workflow Hacks #1 - dots. Tokyo
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
Architecting a next generation data platform
Uncovering SQL Server query problems with execution plans - Tony Davis
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
A Day in the Life of a Druid Implementor and Druid's Roadmap
MicroStrategy at Badoo
Ad

Recently uploaded (20)

PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
web development for engineering and engineering
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Construction Project Organization Group 2.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
composite construction of structures.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
bas. eng. economics group 4 presentation 1.pptx
web development for engineering and engineering
Mechanical Engineering MATERIALS Selection
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Construction Project Organization Group 2.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
additive manufacturing of ss316l using mig welding
Internet of Things (IOT) - A guide to understanding
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
CYBER-CRIMES AND SECURITY A guide to understanding
composite construction of structures.pdf
CH1 Production IntroductoryConcepts.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx

Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers