SlideShare a Scribd company logo
Turning Big Data
into Knowledge
September 25, 2019
Kaan Onuk, Luyao Li, Atul Gupte
Hi, welcome!
● Engineer turned Product Manager
● Previously: building FarmVille & the mobile advertising
platform @ Zynga
● Currently: Product Manager on the Data Platform team
building data science, data knowledge, and interactive
analytics platforms
About me
Atul Gupte
Product Manager
What we’ll talk about today
Data landscape
at Uber
Our journey
since 2016
Metadata management
through Databook
dtbData relationships
through Lineage
lin
We ignite opportunity by setting the world in motion
15M
Trips/Day
700+
Cities
100M
Monthly Users
Data informs every decision at the company
Daily Uber trips
powered by ML
Millions
Messages
processed by Kafka
2T
Queries across
Hive, Vertica and
Presto
1M
Data ingested
into HDFS
150TB
How Big is our Big Data?
Data Platform Team
Move the world with
global data,
local insights, and
intelligent decisions.
Data Infrastructure
Data Platform
DataTools
Data Lake
Logging
Stream
Data
Modelers
Data
Consumers
...
Trips
Users
Data
Engineers
Overview of Data at Uber
Data
Scientists
Raw
~10,000
Curated
~100
Derived
>100,000
Data LakeSources Usage
WAU 8,000+
Queries 1M/day
Pipelines Thousands
Metrics Thousands
Experiments Thousands
ML models 10s of thousands
Self Serve & Open Platform
Use Cases
Eng ETA, surge, safety
DS incentives, churn, pickup
Ops driver onboarding, eats
cash, partner data sharing
Compliance ops metrics, city
Challenges compounded by the scale of Data
Data produced by
Mobile users 100s of millions
Events
Trillions/day
What are users looking to do?
What data exists? How does it look?
Who’s using it? What happens if I
change it?
How can I adapt when this data
changes?
Discover Understand Trust
3+ hours
week
8%
Time wasted
every week
$$$M
Cost to company
Tasks requiring
human skill
Unproductive
time sinks
We power data fluency to help Uber
make confident, data-driven decisions
Any and all users
can access and use
datasets with ease
Users trust our data
because it meets
their expectations
Users access
appropriate data,
through compliant
means
Discover Understand Trust
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale
Discover
Late 2016
● Indexed small amount of data
○ Offline analytics systems
● Datasets only - no other data entities
● Catalogued basic information about datasets
Late 2016
Novice Neville
Data Scientists
Software Engineers
ML Researchers
New to Uber
Requires help finding data
Relies on George for basic tasks
Manager Michelle
General Managers
Product Managers
City Operations
CXOs & other executives
Interface w/regulators &
customers
Meet critical deadlines
Deliver reports and insights
Genius George
Data Scientists
Software Engineers
ML Researchers
Built underlying systems
Tribal knowledge champion
De-facto knowledge bank
Late 2016
2017
HQ Non-HQ
Rideshare Eats Freight ATG Elevate
Support
NLP models for
support tickets
Safety
Trip classification
Uber Eats
Restaurant
recommendations
Operations
LTV models
2017
Data Scientists
Software Engineers
ML/AI Researchers
Advanced SQL
Advanced Statistics
Scala/Spark, Python/R
Data Modeling
Inventor Ivan
Marketing Managers
Entry-level Analysts
General Managers
Product Managers
Limited SQL
Spreadsheets
Reliant Rebecca
City Operations
Regional Managers
Advanced SQL
Spreadsheets
Dashboarding
Monitoring Matt
Operations Managers
Data Analysts
Product Analysts
Advanced SQL
Spreadsheets
Limited Statistics
Limited Python/R
Analyst Anna
2017
2018Cumulativefunctionality
Time
Low internal
quality
High internal
quality
Delivers more rapidly
+ cheaply later
● Users care about a variety of data assets
○ Datasets, dashboards, metrics, etc.
● Users want a holistic view of everything that
exists about their data
○ Ownership
○ Schemas
2018
2018
2018
● Data quality and health are key concerns
● Table usage information is valuable
● Operational and regulatory environment is
growing more complex
○ GDPR
○ Access control & audits
2019
2019
● Unified interface highlighting relevant metadata
○ Ownership & usage
○ Schema and stats
○ Quality & health signals
○ Lineage
2019
● Advanced metadata management
○ Automated ingestion
○ Automated classification
○ Simplified controls for data owners
60+
Types of
metadata
Curiosity Knowledge Wisdom
● Manages Data Lineage team under Data Platform
● Earlier: Senior Software Engineer II at Uber
About me
Luyao Li
Tech Lead Manager
Data Lineage
What is data lineage?
● Where is the data from
● Where it’s been
● How it’s being transformed
“I’m no longer
responsible for this
table, please ask team
X”
“This is an upstream
problem, we can’t fix it”
“Please ask the
table owner”
“How do I find
the pipeline
owner?”
Multiple days
Why does it matter?
Applications
Data Freshness Data Chargebacks Anomaly Detection Compliance
Features
End-to-end Isolated ingestion
Flexible
consumption
Advanced
filtering
10,000-foot
view
1,000-foot
view
Lessons learned
High quality data is essential for success
Always be customer obsessed
Magical search has a huge impact on usability
Make big, bold bets
What’s next
● Column-level lineage
● Self-diagnostic and reporting
● Self-serve onboarding
● Recommendation
/ Manages the Metadata Platform team within Big Data
/ Previously - Senior Software Engineer / Tech Lead for
Data Discovery & Data Privacy @ Uber
About me
Kaan Onuk
Engineering Manager
/ What is metadata?
/ Why does metadata matter?
What is
metadata?
Uber’s massive data holds deep
hidden insights.
Metadata helps to surface them.
Metadata drives data
productivity by making data easy
to discover, understand, and
govern.
/ Metadata Sources
/ Metadata Registry
/ Metadata Collection
/ Metadata Storage
/ Data Model
Metadata
Sources
uMetadata
Metadata Registry/Definition
Metadata Collection
Pull model Push model
○ Crawler (periodic)
e.g. sample data, stats
○ Event-based (Event Listeners)
e.g. data quality
○ Automated
e.g. data retention policies
○ Crowdsource
e.g. table descriptions
Storage
● Hive for analytical queries
and audit purposes
● Kafka to capture
metadata changes
● MySQL for persistent
storage
● Redis for cache to support
low latency & high
throughput
● Search functionality
powers various internal
platform including
Databook for data
discovery
Metadata Store: Data Model Requirements
1. Discovery
2. Cluster-specific & agnostic metadata
4. Flexibility on onboarding new entities
3. Easy metadata type creation
Metadata
Store
Data Model
Metadata
Management
(2019)
1. Easy onboarding
2. Derived metadata
through relationships
3. Efficient metadata
retrieval
Key Takeaways
1. Centralized Metastore: Datasets + Artifacts
2. Metadata Registry: Taxonomy / Metadata scheme
3. Metadata Collection: Choose the right approach
4. Data Model: Leverage metadata relationships
Next Steps
Metadata Management
Innovate
- Personalization to improve discovery
- Graph traversal optimizations
Automate
- Human-in-the-loop AI
Establish trust & accountability
- More integrations with data infra
- Very high qps & low latency
Accessible but secure foundation
- Fully self-served, ontology-based metadata management
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale
Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information
to any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
Thank you!
kaan@uber.com

More Related Content

PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Analytics Trends 2016: The next evolution
PDF
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
High Tech Digital Transformation
PDF
Data Catalog as a Business Enabler
PDF
The Death of the Star Schema
PPTX
Oracle Blockchain Cloud Service
Democratizing Data Quality Through a Centralized Platform
Analytics Trends 2016: The next evolution
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Data Lakehouse, Data Mesh, and Data Fabric (r1)
High Tech Digital Transformation
Data Catalog as a Business Enabler
The Death of the Star Schema
Oracle Blockchain Cloud Service

What's hot (20)

PPTX
Regulatory Reporting Dashboard
PPTX
Understanding Digital transformation
PPT
Enterprise Master Data Architecture
PDF
Collibra - Forrester Presentation : Data Governance 2.0
PDF
Thoughtspot Pitch Deck
PDF
Future of Data Engineering
PDF
Building a Data Strategy – Practical Steps for Aligning with Business Goals
PDF
8 Steps to Creating a Data Strategy
PDF
Digital Transformation
PPTX
[Accenture] Digital Business 2017
PDF
Digital Transformation
PPTX
Building the enterprise data architecture
PDF
Life Sciences Commercial Services | Accenture
PPTX
Future of AI: Blockchain & Deep Learning
PPTX
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
PDF
Practical Guide to Data Governance Success
PPT
Document Management
PPTX
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
PDF
Building a Data Strategy – Practical Steps for Aligning with Business Goals
PDF
Banking as a Service (download)
Regulatory Reporting Dashboard
Understanding Digital transformation
Enterprise Master Data Architecture
Collibra - Forrester Presentation : Data Governance 2.0
Thoughtspot Pitch Deck
Future of Data Engineering
Building a Data Strategy – Practical Steps for Aligning with Business Goals
8 Steps to Creating a Data Strategy
Digital Transformation
[Accenture] Digital Business 2017
Digital Transformation
Building the enterprise data architecture
Life Sciences Commercial Services | Accenture
Future of AI: Blockchain & Deep Learning
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
Practical Guide to Data Governance Success
Document Management
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Banking as a Service (download)

Similar to [Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale (20)

PPTX
Data Analytics in Digital Transformation
PDF
Capturing big value in big data
PDF
Machine Data Analytics
PDF
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
PPTX
Big Data in Business Application use case and benefits
PDF
The ABCs of Treating Data as Product
PDF
Accelerate Self-Service Analytics with Data Virtualization and Visualization
PPTX
Top Business Intelligence Trends for 2016 by Panorama Software
PDF
Accelerate Self-Service Analytics with Data Virtualization and Visualization
PPTX
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
PDF
Sgcp14dunlea
PDF
Transforming GE Healthcare with Data Platform Strategy
PDF
Taming Big Data With Modern Software Architecture
PPT
UNIT - 1 : Part 1: Data Warehousing and Data Mining
PPTX
Big data analytics in banking sector
PPTX
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
PDF
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
PDF
A Winning Strategy for the Digital Economy
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PDF
Maximizing Your Data’s Potential: DOTs & DPWs Edition
Data Analytics in Digital Transformation
Capturing big value in big data
Machine Data Analytics
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
Big Data in Business Application use case and benefits
The ABCs of Treating Data as Product
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Top Business Intelligence Trends for 2016 by Panorama Software
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Sgcp14dunlea
Transforming GE Healthcare with Data Platform Strategy
Taming Big Data With Modern Software Architecture
UNIT - 1 : Part 1: Data Warehousing and Data Mining
Big data analytics in banking sector
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
A Winning Strategy for the Digital Economy
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Maximizing Your Data’s Potential: DOTs & DPWs Edition

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
A Presentation on Artificial Intelligence
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Monthly Chronicles - July 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectral efficient network and resource selection model in 5G networks
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
“AI and Expert System Decision Support & Business Intelligence Systems”
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
A Presentation on Artificial Intelligence
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
Understanding_Digital_Forensics_Presentation.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale

  • 1. Turning Big Data into Knowledge September 25, 2019 Kaan Onuk, Luyao Li, Atul Gupte
  • 3. ● Engineer turned Product Manager ● Previously: building FarmVille & the mobile advertising platform @ Zynga ● Currently: Product Manager on the Data Platform team building data science, data knowledge, and interactive analytics platforms About me Atul Gupte Product Manager
  • 4. What we’ll talk about today Data landscape at Uber Our journey since 2016 Metadata management through Databook dtbData relationships through Lineage lin
  • 5. We ignite opportunity by setting the world in motion 15M Trips/Day 700+ Cities 100M Monthly Users
  • 6. Data informs every decision at the company
  • 7. Daily Uber trips powered by ML Millions Messages processed by Kafka 2T Queries across Hive, Vertica and Presto 1M Data ingested into HDFS 150TB How Big is our Big Data?
  • 8. Data Platform Team Move the world with global data, local insights, and intelligent decisions.
  • 9. Data Infrastructure Data Platform DataTools Data Lake Logging Stream Data Modelers Data Consumers ... Trips Users Data Engineers Overview of Data at Uber Data Scientists
  • 10. Raw ~10,000 Curated ~100 Derived >100,000 Data LakeSources Usage WAU 8,000+ Queries 1M/day Pipelines Thousands Metrics Thousands Experiments Thousands ML models 10s of thousands Self Serve & Open Platform Use Cases Eng ETA, surge, safety DS incentives, churn, pickup Ops driver onboarding, eats cash, partner data sharing Compliance ops metrics, city Challenges compounded by the scale of Data Data produced by Mobile users 100s of millions Events Trillions/day
  • 11. What are users looking to do? What data exists? How does it look? Who’s using it? What happens if I change it? How can I adapt when this data changes?
  • 15. We power data fluency to help Uber make confident, data-driven decisions Any and all users can access and use datasets with ease Users trust our data because it meets their expectations Users access appropriate data, through compliant means
  • 20. Late 2016 ● Indexed small amount of data ○ Offline analytics systems ● Datasets only - no other data entities ● Catalogued basic information about datasets
  • 21. Late 2016 Novice Neville Data Scientists Software Engineers ML Researchers New to Uber Requires help finding data Relies on George for basic tasks Manager Michelle General Managers Product Managers City Operations CXOs & other executives Interface w/regulators & customers Meet critical deadlines Deliver reports and insights Genius George Data Scientists Software Engineers ML Researchers Built underlying systems Tribal knowledge champion De-facto knowledge bank
  • 23. 2017 HQ Non-HQ Rideshare Eats Freight ATG Elevate Support NLP models for support tickets Safety Trip classification Uber Eats Restaurant recommendations Operations LTV models
  • 24. 2017 Data Scientists Software Engineers ML/AI Researchers Advanced SQL Advanced Statistics Scala/Spark, Python/R Data Modeling Inventor Ivan Marketing Managers Entry-level Analysts General Managers Product Managers Limited SQL Spreadsheets Reliant Rebecca City Operations Regional Managers Advanced SQL Spreadsheets Dashboarding Monitoring Matt Operations Managers Data Analysts Product Analysts Advanced SQL Spreadsheets Limited Statistics Limited Python/R Analyst Anna
  • 25. 2017
  • 27. ● Users care about a variety of data assets ○ Datasets, dashboards, metrics, etc. ● Users want a holistic view of everything that exists about their data ○ Ownership ○ Schemas 2018
  • 28. 2018
  • 29. 2018 ● Data quality and health are key concerns ● Table usage information is valuable ● Operational and regulatory environment is growing more complex ○ GDPR ○ Access control & audits
  • 30. 2019
  • 31. 2019 ● Unified interface highlighting relevant metadata ○ Ownership & usage ○ Schema and stats ○ Quality & health signals ○ Lineage
  • 32. 2019 ● Advanced metadata management ○ Automated ingestion ○ Automated classification ○ Simplified controls for data owners 60+ Types of metadata
  • 34. ● Manages Data Lineage team under Data Platform ● Earlier: Senior Software Engineer II at Uber About me Luyao Li Tech Lead Manager
  • 36. What is data lineage? ● Where is the data from ● Where it’s been ● How it’s being transformed
  • 37. “I’m no longer responsible for this table, please ask team X” “This is an upstream problem, we can’t fix it” “Please ask the table owner” “How do I find the pipeline owner?” Multiple days Why does it matter?
  • 38. Applications Data Freshness Data Chargebacks Anomaly Detection Compliance
  • 42. Lessons learned High quality data is essential for success Always be customer obsessed Magical search has a huge impact on usability Make big, bold bets
  • 43. What’s next ● Column-level lineage ● Self-diagnostic and reporting ● Self-serve onboarding ● Recommendation
  • 44. / Manages the Metadata Platform team within Big Data / Previously - Senior Software Engineer / Tech Lead for Data Discovery & Data Privacy @ Uber About me Kaan Onuk Engineering Manager
  • 45. / What is metadata? / Why does metadata matter?
  • 47. Uber’s massive data holds deep hidden insights. Metadata helps to surface them.
  • 48. Metadata drives data productivity by making data easy to discover, understand, and govern.
  • 49. / Metadata Sources / Metadata Registry / Metadata Collection / Metadata Storage / Data Model
  • 52. Metadata Collection Pull model Push model ○ Crawler (periodic) e.g. sample data, stats ○ Event-based (Event Listeners) e.g. data quality ○ Automated e.g. data retention policies ○ Crowdsource e.g. table descriptions
  • 53. Storage ● Hive for analytical queries and audit purposes ● Kafka to capture metadata changes ● MySQL for persistent storage ● Redis for cache to support low latency & high throughput ● Search functionality powers various internal platform including Databook for data discovery
  • 54. Metadata Store: Data Model Requirements 1. Discovery 2. Cluster-specific & agnostic metadata 4. Flexibility on onboarding new entities 3. Easy metadata type creation
  • 56. Metadata Management (2019) 1. Easy onboarding 2. Derived metadata through relationships 3. Efficient metadata retrieval
  • 57. Key Takeaways 1. Centralized Metastore: Datasets + Artifacts 2. Metadata Registry: Taxonomy / Metadata scheme 3. Metadata Collection: Choose the right approach 4. Data Model: Leverage metadata relationships
  • 58. Next Steps Metadata Management Innovate - Personalization to improve discovery - Graph traversal optimizations Automate - Human-in-the-loop AI Establish trust & accountability - More integrations with data infra - Very high qps & low latency Accessible but secure foundation - Fully self-served, ontology-based metadata management
  • 60. Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. Thank you! kaan@uber.com