Distributed Data Quality - Technical Solutions for Organizational Scaling

Justin Cunningham
justinc@yelp.com
Distributed Data Quality
Technical Solutions for Organizational Scaling

Yelp’s Mission
Connecting people with great
local businesses.

Lots of Services, Lots of Teams
More than 500 engineers
Service
Service
Service
Service
Service

Giant ETL Team?
“This isn’t useful, can
you just do all the data?”

MySQL
Other Data
Stores
Yelp-main
Services
SCHEMATIZER
MySQL
Services
Yelp-main
Redshift
S3/Parquet
Cassandra
Elasticsearch
Cassandra
Hollow
KAFKAKAFKA
• Paastorm
• Python
• Flatmap
• Flink*
• Java/Scala
• Advanced
Primitives &
Stream SQL
Recursive
Memcached
Sources SinksStream Processing
Yelp's Streaming Architecture
bit.ly/kafka-talk

MySQL
Other Data
Stores
Yelp-main
Services
SCHEMATIZER
MySQL
Services
Yelp-main
Redshift
S3/Parquet
Cassandra
Elasticsearch
Cassandra
Hollow
KAFKAKAFKA
• Paastorm
• Python
• Flatmap
• Flink*
• Java/Scala
• Advanced
Primitives &
Stream SQL
Recursive
Memcached
Extract LoadTransform
Yelp's Streaming Architecture
bit.ly/kafka-talk

Documentation, Discovery and Ownership
What does this column mean?

Schema Registration and Model Extraction

Which data is available?
Which data should I use?
{data}

Where did it come from?
Data Lineage

Derived Schemas in Registration

Consumer/Producer Registration v1

Consumer/Producer Registration v2

Does all this stuff actually work?

Auditors Check Output Matches Invariant

Meta Attributes on the Message Envelope

Is This Data Up-to-date?
High Event-time Delay Temporary? Delay
Low Event-time Delay No Delay No Delay
Low Offset Delay High Offset Delay
Both Event-Time Delay and Offset Delay are necessary - need to
differentiate no new data v. delay.
High event time delay is only an issue when there are
unprocessed offsets, and that persists for some time.

Report Consistent Dimensions
From Code and Kafka

MySQL
Other Data
Stores
Yelp-main
Services
SCHEMATIZER
MySQL
Services
Yelp-main
Redshift
S3/Parquet
Cassandra
Elasticsearch
Cassandra
Hollow
KAFKAKAFKA
• Paastorm
• Python
• Flatmap
• Flink*
• Java/Scala
• Advanced
Primitives &
Stream SQL
Recursive
Memcached
Sources SinksStream Processing
How can I get that data?
Access Polymorphism

Users Add Connections from CLI

Data Lake & the Long-term Workflow
S3
Data Lake
CLI

DAG View for Schema ID 1
DAG View is based on the connected component of
everything touching the schema.
Where We're Headed

In Summary
● Context Enables Alignment - Reliable Shared Data Builds Context
● What does this column mean? - Documentation and Ownership
● What data is available? What data should I use? - Watson and Curated Data
Sets through Tags
● Is this data accurate?
○ Where did it come from? - Data Lineage and Consumer/Producer
Registration
○ Did all that stuff work? - Data Auditing
○ Is this data up-to-date? - Event-time and Offset monitoring
● How can I get that data? - Declarative Data Connections

www.yelp.com/careers/
We're Hiring!

@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp

Distributed Data Quality - Technical Solutions for Organizational Scaling

More Related Content

What's hot (20)

Similar to Distributed Data Quality - Technical Solutions for Organizational Scaling (20)

Recently uploaded (20)

Distributed Data Quality - Technical Solutions for Organizational Scaling