SlideShare a Scribd company logo
Confidentia
l
Mao Ye
Big Data Platform at interest
1
Data Architecture
Design Choices for Hadoop Platform
Pinball for Workflow Management
Data Architecture
Data at Pinterest
• 60 Billion Pins
• 1 Billion boards
• 100M MAU
• 60 PB of data on S3
• 3 PB processed every day
• 2000 node Hadoop cluster
• 250 engineers
Pinterest Data Architecture
App
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Skyline
Pinball
Redshift
Pinalytics
Features
Qubole
(Hadoop)
Singer
Design Choices for Hadoop Platform
• Ephemeral clusters
• Access control layer
• Shared data store
• Easy deployment
Hadoop Platform Requirements
• Isolated multi-tenancy
• Elasticity
• Support multiple
clusters
Decoupling compute & storage
Hadoop Cluster 1
Transient
HDFS
Hadoop Cluster 2
Transient
HDFS
S3 Persistent
Store
Centralized Hive Metastore
Hive
Metastore
Pig
Cascading
Hive
HDFS/S3
DataMetadata
Multi-layered Packaging
Mapreduce Jobs
Hadoop Jars/Libs
Job/User level Configs
Software Packages/Libs
Configs (OS/Hadoop)
Misc Sys Admin
OS
Bootstrap Script
Core SW
Runtime Staging
(on S3)
Automated
Configuration
(Masterless Puppet)
Baked AMI
Executor Abstraction Layer
Hive
Metastore
HDFS/S3
Qubole
Managed
Hadoop
EMR
Executor
Pinball
Dev
Server
• API for simplified
executor abstraction
• Advanced support
for spot instances
• Baked AMI
customization
Why Qubole?
• Hadoop & Spark as
managed services
• Tight integration with
Hive
• Graceful cluster
scaling
Confidentia
l
Pinball for Workflow Management
Confidentia
l
● Scale:
o 60 Billion Pins
o Hundreds of workflows
o Thousands of jobs
o 500+ jobs in a workflow
o 3 petabytes processed daily
● Support:
o Hadoop, Cascading, Hive, Spark …
Scale of Processing
job
workflow
Confidentia
l
Why Pinball?
● Requirements
o Simple abstractions
o Extensible in future
o Reliable stateless computing
o Easy to debug
o Scales horizontally
o Can be upgraded w/o aborting workflows
o Rich features like auto-retries, per-job emails, overrun
policies…
● Options
o Apache Oozie, Azkaban, Luigi
Confidentia
l
Pinball Design
Master
Worker
Scheduler
Command
Line Clients
UI
Confidentia
l
● Workflow
o A directed graph of
nodes called jobs
● Edge
o Run after
dependence
● Node
o Job is a node
Workflow Model
Confidentia
l
Job State
● Job state is captured in a token
● Tokens are named hierarchically
Master
Job Token
version: 123
name: /workflow/w1/job
owner: worker_0
expiration: 1234567
data: JobTemplate(....)
Confidentia
l
Job State Machine
RUNNABLE
RUNNINGWAITING
Confidentia
l
● Master keeps the state
● Workers claim and execute tasks
● Horizontally scalable
Master Worker Interaction
Worker Master Persistent Store
1: request 2: update
3: ack
Confidentia
l
Master
● Entire state is kept in memory
● Each state update is synchronously persisted
before master replies to client
● Master runs on a single thread – no
concurrency issues
Confidentia
l
Worker
Confidentia
l
Open Source
Git repo:
https://github.com/pinterest/pinball
Mailing list:
https://groups.google.com/forum/#!forum/
pinball-users
Confidentia
l
Thank You

More Related Content

PDF
Hive Bucketing in Apache Spark with Tejas Patil
PPTX
The Impala Cookbook
PPTX
Securing Hadoop with Apache Ranger
PPTX
Presto: SQL-on-anything
PDF
Module 2 - Datalake
PPTX
Hadoop configuration & performance tuning
PDF
Iceberg: a fast table format for S3
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
Hive Bucketing in Apache Spark with Tejas Patil
The Impala Cookbook
Securing Hadoop with Apache Ranger
Presto: SQL-on-anything
Module 2 - Datalake
Hadoop configuration & performance tuning
Iceberg: a fast table format for S3
Apache Tez - A New Chapter in Hadoop Data Processing

What's hot (20)

PPTX
Cloudera training: secure your Cloudera cluster
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
PPTX
Evening out the uneven: dealing with skew in Flink
PDF
Considerations for Data Access in the Lakehouse
PPT
Cloudera Impala Internals
PPTX
Apache Tez: Accelerating Hadoop Query Processing
ODP
Presto
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
PDF
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
PDF
What's New in Apache Hive
PPTX
Introduction to Hadoop and Hadoop component
PDF
Understanding Presto - Presto meetup @ Tokyo #1
PDF
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
RESTful Web Services
PDF
Introduction to apache kafka, confluent and why they matter
PPT
Hadoop hive presentation
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
Cloudera training: secure your Cloudera cluster
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Evening out the uneven: dealing with skew in Flink
Considerations for Data Access in the Lakehouse
Cloudera Impala Internals
Apache Tez: Accelerating Hadoop Query Processing
Presto
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
What's New in Apache Hive
Introduction to Hadoop and Hadoop component
Understanding Presto - Presto meetup @ Tokyo #1
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Kafka Architecture & Fundamentals Explained
RESTful Web Services
Introduction to apache kafka, confluent and why they matter
Hadoop hive presentation
Security and Data Governance using Apache Ranger and Apache Atlas
Ad

Similar to Big Data Platform at Pinterest (20)

PPTX
Pinterest hadoop summit_talk
PPTX
50 Billion pins and counting: Using Hadoop to build data driven Products
PDF
Webinar - DreamObjects/Ceph Case Study
PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
PDF
Openstack India May Meetup
PDF
Serverless SQL
PDF
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
PDF
Michael stack -the state of apache h base
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Facebook Presto presentation
PPT
AWS (Hadoop) Meetup 30.04.09
PDF
Netflix Open Source Meetup Season 4 Episode 2
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
PDF
Low Latency Polyglot Model Scoring using Apache Apex
PPTX
Sql Start! 2020 - SQL Server Lift & Shift su Azure
PDF
Low latency high throughput streaming using Apache Apex and Apache Kudu
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
YARN: a resource manager for analytic platform
PDF
Modern MySQL Monitoring and Dashboards.
PPTX
OS for AI: Elastic Microservices & the Next Gen of ML
Pinterest hadoop summit_talk
50 Billion pins and counting: Using Hadoop to build data driven Products
Webinar - DreamObjects/Ceph Case Study
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Openstack India May Meetup
Serverless SQL
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Michael stack -the state of apache h base
AWS Big Data Demystified #1: Big data architecture lessons learned
Facebook Presto presentation
AWS (Hadoop) Meetup 30.04.09
Netflix Open Source Meetup Season 4 Episode 2
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Low Latency Polyglot Model Scoring using Apache Apex
Sql Start! 2020 - SQL Server Lift & Shift su Azure
Low latency high throughput streaming using Apache Apex and Apache Kudu
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
YARN: a resource manager for analytic platform
Modern MySQL Monitoring and Dashboards.
OS for AI: Elastic Microservices & the Next Gen of ML
Ad

More from Qubole (20)

PPTX
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
PDF
7 Big Data Challenges and How to Overcome Them
PDF
State of Big Data Adoption
PPTX
Big Data at Pinterest - Presented by Qubole
PDF
5 Factors Impacting Your Big Data Project's Performance
PPTX
Spark on Yarn
PPTX
Atlanta MLConf
PDF
Running Spark on Cloud
PDF
Qubole State of the Big Data Industry
PPTX
Atlanta Data Science Meetup | Qubole slides
PPTX
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
PDF
BIPD Tech Tuesday Presentation - Qubole
PDF
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
PPTX
Optimizing Big Data to run in the Public Cloud
PPTX
Getting to 1.5M Ads/sec: How DataXu manages Big Data
PDF
Expert Big Data Tips
PPTX
Big dataproposal
PDF
Presto in the cloud
PPTX
Basic Sentiment Analysis using Hive
PDF
Effective Hive Queries
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
7 Big Data Challenges and How to Overcome Them
State of Big Data Adoption
Big Data at Pinterest - Presented by Qubole
5 Factors Impacting Your Big Data Project's Performance
Spark on Yarn
Atlanta MLConf
Running Spark on Cloud
Qubole State of the Big Data Industry
Atlanta Data Science Meetup | Qubole slides
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
BIPD Tech Tuesday Presentation - Qubole
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Optimizing Big Data to run in the Public Cloud
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Expert Big Data Tips
Big dataproposal
Presto in the cloud
Basic Sentiment Analysis using Hive
Effective Hive Queries

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PDF
Introduction to Business Data Analytics.
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Supervised vs unsupervised machine learning algorithms
Database Infoormation System (DBIS).pptx
Introduction to Business Data Analytics.
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IBA_Chapter_11_Slides_Final_Accessible.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Reliability_Chapter_ presentation 1221.5784
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Clinical guidelines as a resource for EBP(1).pdf
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Galatica Smart Energy Infrastructure Startup Pitch Deck
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
.pdf is not working space design for the following data for the following dat...
oil_refinery_comprehensive_20250804084928 (1).pptx
1_Introduction to advance data techniques.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Supervised vs unsupervised machine learning algorithms

Big Data Platform at Pinterest