SlideShare a Scribd company logo
Justin Cunningham
justinc@yelp.com
Distributed Data Quality
Technical Solutions for Organizational Scaling
Yelp’s Mission
Connecting people with great
local businesses.
Lots of Services, Lots of Teams
More than 500 engineers
Service
Service
Service
Service
Service
Alignment and Autonomy
Alignment Requires Context
Shared Data Provides Context
Giant ETL Team?
“This isn’t useful, can
you just do all the data?”
What we did before?
MySQL
Other Data
Stores
Yelp-main
Services
SCHEMATIZER
MySQL
Services
Yelp-main
Redshift
S3/Parquet
Cassandra
Elasticsearch
Cassandra
Hollow
KAFKAKAFKA
• Paastorm
• Python
• Flatmap
• Flink*
• Java/Scala
• Advanced
Primitives &
Stream SQL
Recursive
Memcached
Sources SinksStream Processing
Yelp's Streaming Architecture
bit.ly/kafka-talk
MySQL
Other Data
Stores
Yelp-main
Services
SCHEMATIZER
MySQL
Services
Yelp-main
Redshift
S3/Parquet
Cassandra
Elasticsearch
Cassandra
Hollow
KAFKAKAFKA
• Paastorm
• Python
• Flatmap
• Flink*
• Java/Scala
• Advanced
Primitives &
Stream SQL
Recursive
Memcached
Extract LoadTransform
Yelp's Streaming Architecture
bit.ly/kafka-talk
The Rest of this Talk
Documentation, Discovery and Ownership
What does this column mean?
Schema Registration and Model Extraction
Schema Registration and Model Extraction
Schema Registration and Model Extraction
Which data is available?
Which data should I use?
{data}
Where did it come from?
Data Lineage
Where did it come from?
Derived Schemas in Registration
Where did it come from?
Consumer/Producer Registration v1
Where did it come from?
Consumer/Producer Registration v2
Where did it come from?
Consumer/Producer Registration v2
Does all this stuff actually work?
Auditors Check Output Matches Invariant
Meta Attributes on the Message Envelope
Meta Attributes on the Message Envelope
Is This Data Up-to-date?
High Event-time Delay Temporary? Delay
Low Event-time Delay No Delay No Delay
Low Offset Delay High Offset Delay
Both Event-Time Delay and Offset Delay are necessary - need to
differentiate no new data v. delay.
High event time delay is only an issue when there are
unprocessed offsets, and that persists for some time.
Is This Data Up-to-date?
Report Consistent Dimensions
From Code and Kafka
MySQL
Other Data
Stores
Yelp-main
Services
SCHEMATIZER
MySQL
Services
Yelp-main
Redshift
S3/Parquet
Cassandra
Elasticsearch
Cassandra
Hollow
KAFKAKAFKA
• Paastorm
• Python
• Flatmap
• Flink*
• Java/Scala
• Advanced
Primitives &
Stream SQL
Recursive
Memcached
Sources SinksStream Processing
How can I get that data?
Access Polymorphism
Users Add Connections from CLI
Data Lake & the Long-term Workflow
S3
Data Lake
CLI
DAG View for Schema ID 1
DAG View is based on the connected component of
everything touching the schema.
Where We're Headed
In Summary
● Context Enables Alignment - Reliable Shared Data Builds Context
● What does this column mean? - Documentation and Ownership
● What data is available? What data should I use? - Watson and Curated Data
Sets through Tags
● Is this data accurate?
○ Where did it come from? - Data Lineage and Consumer/Producer
Registration
○ Did all that stuff work? - Data Auditing
○ Is this data up-to-date? - Event-time and Offset monitoring
● How can I get that data? - Declarative Data Connections
www.yelp.com/careers/
We're Hiring!
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp

More Related Content

PDF
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
PDF
Time Series Analysis Using an Event Streaming Platform
PDF
Enterprise Metadata Integration
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
PDF
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
PDF
Cassandra Lunch #23: Lucene Based Indexes on Cassandra
PDF
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...
PDF
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
Event & Data Mesh as a Service: Industrializing Microservices in the Enterpri...
Time Series Analysis Using an Event Streaming Platform
Enterprise Metadata Integration
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
Cassandra Lunch #23: Lucene Based Indexes on Cassandra
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...

What's hot (20)

PDF
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
PDF
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
PDF
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
PPTX
PCAP Graphs for Cybersecurity and System Tuning
PPTX
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
PPTX
Netflix Big Data Paris 2017
PDF
Kafka Summit SF 2017 - DNS for Data: The Need for a Stream Registry
PDF
Money Heist - A Stream Processing Original! | Meha Pandey and Shengze Yu, Net...
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
PDF
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
PDF
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
PDF
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
PDF
Cloud Connect 2012, Big Data @ Netflix
PDF
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
PDF
Can Apache Kafka Replace a Database?
PPTX
Realtime Business Platform Architecture Review
PDF
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
PDF
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
PCAP Graphs for Cybersecurity and System Tuning
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data in the Cloud with Confluent and MongoDB Atlas | Robert Walters...
Netflix Big Data Paris 2017
Kafka Summit SF 2017 - DNS for Data: The Need for a Stream Registry
Money Heist - A Stream Processing Original! | Meha Pandey and Shengze Yu, Net...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
Cloud Connect 2012, Big Data @ Netflix
How a Data Mesh is Driving our Platform | Trey Hicks, Gloo
Can Apache Kafka Replace a Database?
Realtime Business Platform Architecture Review
Druid + Kafka: transform your data-in-motion to analytics-in-motion | Gian Me...
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
Ad

Similar to Distributed Data Quality - Technical Solutions for Organizational Scaling (20)

PDF
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
PDF
Intelligent Integration OOW2017 - Jeff Pollock
PPTX
The Future of Data Engineering - 2019 InfoQ QConSF
PPTX
How Kafka and Modern Databases Benefit Apps and Analytics
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PDF
Cloud-native Semantic Layer on Data Lake
PDF
Cardinality-HL-Overview
PPTX
Real-Time Analytics with Spark and MemSQL
PDF
Data Warehouse or Data Lake, Which Do I Choose?
PDF
Powering Interactive BI Analytics with Presto and Delta Lake
PDF
First in Class: Optimizing the Data Lake for Tighter Integration
PPTX
Big Data Introduction - Solix empower
PDF
Cloud Lambda Architecture Patterns
PDF
Amazon Elastic Map Reduce - Ian Meyers
PDF
Delivering business insights and automation utilizing aws data services
PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
PDF
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
PDF
C19013010 the tutorial to build shared ai services session 2
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Intelligent Integration OOW2017 - Jeff Pollock
The Future of Data Engineering - 2019 InfoQ QConSF
How Kafka and Modern Databases Benefit Apps and Analytics
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Cloud-native Semantic Layer on Data Lake
Cardinality-HL-Overview
Real-Time Analytics with Spark and MemSQL
Data Warehouse or Data Lake, Which Do I Choose?
Powering Interactive BI Analytics with Presto and Delta Lake
First in Class: Optimizing the Data Lake for Tighter Integration
Big Data Introduction - Solix empower
Cloud Lambda Architecture Patterns
Amazon Elastic Map Reduce - Ian Meyers
Delivering business insights and automation utilizing aws data services
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
SQL Analytics Powering Telemetry Analysis at Comcast
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
C19013010 the tutorial to build shared ai services session 2
Ad

Recently uploaded (20)

PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
top salesforce developer skills in 2025.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
medical staffing services at VALiNTRY
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
AI in Product Development-omnex systems
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
ai tools demonstartion for schools and inter college
PPTX
history of c programming in notes for students .pptx
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Introduction to Artificial Intelligence
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Operating system designcfffgfgggggggvggggggggg
Upgrade and Innovation Strategies for SAP ERP Customers
2025 Textile ERP Trends: SAP, Odoo & Oracle
top salesforce developer skills in 2025.pdf
Odoo POS Development Services by CandidRoot Solutions
medical staffing services at VALiNTRY
Design an Analysis of Algorithms II-SECS-1021-03
ManageIQ - Sprint 268 Review - Slide Deck
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
AI in Product Development-omnex systems
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
ai tools demonstartion for schools and inter college
history of c programming in notes for students .pptx
Wondershare Filmora 15 Crack With Activation Key [2025
PTS Company Brochure 2025 (1).pdf.......
Introduction to Artificial Intelligence
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx

Distributed Data Quality - Technical Solutions for Organizational Scaling