SlideShare a Scribd company logo
Confidentia
l
Using Hadoop to build data driven Products
50 Billion pins and counting
Krishna Gade
1
What is Pinterest?
A visual bookmarking tool
Discover an inspiring idea
Save it to a board
Go do it
Krishna Gade
• Data Engineering at
Pinterest
• Search and Data
platforms at Twitter and
Bing
• Follow @krishnagade
Who am I?
Pinterest is a data product
50 Billion pins and counting: Using Hadoop to build data driven Products
Why do we care about data?
How is Hadoop helping us to harness the
power of the data?
What are some of the tools we built on top
of Hadoop Platform?
Why do we care about data?
50 Billion pins and counting: Using Hadoop to build data driven Products
3.375
5’10”
50 Billion pins and counting: Using Hadoop to build data driven Products
50 Billion pins and counting: Using Hadoop to build data driven Products
< uncertainty
> odds of making the
best decisions
15
It is a capital mistake to theorize
before one has data.
- Sherlock Holmes
How is Hadoop helping us to harness the
power of the data?
Data at Pinterest
• 50 Billion Pins
• 1 Billion boards
• 40 PB of data on S3
• 3 PB processed every day
• 2000 node Hadoop cluster
• 200 engineers
Pinterest Data Architecture
App
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Singer
Pinterest Data Architecture
App
events
Kafka
Secor
Skyline
Pinball
Redshift
Pinalytics
Features
Qubole
(Hadoop)
Singer
• Ephemeral clusters
• Access control layer
• Shared data store
• Easy deployment
Hadoop Platform Requirements
• Isolated multi-tenancy
• Elasticity
• Support multiple
clusters
Confidentia
l
Design Choices
23
Decoupling compute & storage
Hadoop Cluster 1
Transient
HDFS
Hadoop Cluster 2
Transient
HDFS
S3 Persistent
Store
Centralized Hive Metastore
Hive
Metastore
Pig
Cascading
Hive
HDFS/S3
DataMetadata
Multi-layered Packaging
Mapreduce Jobs
Hadoop Jars/Libs
Job/User level Configs
Software Packages/Libs
Configs (OS/Hadoop)
Misc Sys Admin
OS
Bootstrap Script
Core SW
Runtime Staging
(on S3)
Automated
Configuration
(Masterless Puppet)
Baked AMI
Executor Abstraction Layer
Hive
Metastore
HDFS/S3
Qubole
Managed
Hadoop
EMR
Executor
Pinball
Dev
Server
• API for simplified
executor abstraction
• Advanced support
for spot instances
• Baked AMI
customization
Why Qubole?
• Hadoop & Spark as
managed services
• Tight integration with
Hive
• Graceful cluster
scaling
Confidentia
l
● Scale:
o 50 Billion Pins
o Hundreds of workflows
o Thousands of jobs
o 500+ jobs in a workflow
o 3 petabytes processed daily
● Support:
o Hadoop, Cascading, Hive, Spark …
Scale of Processing
job
workflow
Confidentia
l
Pinball
30
Confidentia
l
Why Pinball?
● Requirements
o Simple abstractions
o Extensible in future
o Reliable stateless computing
o Easy to debug
o Scales horizontally
o Can be upgraded w/o aborting workflows
o Rich features like auto-retries, per-job emails, overrun
policies…
● Options
o Apache Oozie, Azkaban, Luigi
Confidentia
l
Pinball Design
Confidentia
l
● Workflow
o A directed graph of
nodes called jobs
● Edge
o Run after
dependence
● Node
o Job is a node
Workflow Model
Confidentia
l
Job State
● Job state is captured in a token
● Tokens are named hierarchically
Master
Job Token
version: 123
name: /workflow/w1/job
owner: worker_0
expiration: 1234567
data: JobTemplate(....)
Confidentia
l
Job State Machine
Confidentia
l
● Master keeps the state
● Workers claim and execute tasks
● Horizontally scalable
Master Worker Interaction
Worker Master Persistent Store
1: request 2: update
3: ack
Confidentia
l
Master
● Entire state is kept in memory
● Each state update is synchronously persisted
before master replies to client
● Master runs on a single thread – no
concurrency issues
Confidentia
l
Worker
Confidentia
l
Open Source
Git repo:
https://github.com/pinterest/pinball
Mailing list:
https://groups.google.com/forum/#!forum/
pinball-users
Confidentia
l
Data Driven Products
40
Confidentia
l
Guided Search
Confidentia
l
Related Pins
What are some of the tools we built on top
of Hadoop Platform?
Confidentia
l
Scalable Data Analytics Engine
Pinalytics
44
Confidentia
l
Architecture
45
Backend
Thrift Services and Hbase databases
Webapp
Rich UI Components
Reporter
Generates formatted data
Metrics
Customized optimizations
1
2
3
4
Main Components
Confidentia
l
Visualizations
• Highcharts
• Time-series updated automatically
daily
Customizability
• Dashboards
• Built-in or user-defined reports
User Interface
47
Confidentia
l
Pinomaly
• Anomalous metric tracking
• Email alerts
Reporting
• Formatted dashboards
• PDF printing
• Duplicated weekly
Metric Manipulation
• Metric Composer
• Global operations (segmentation,
rollup/aggregation, etc).
User Interface
48
Confidentia
l
Date, seg1, seg2, ... => value
• Store the value for every possible segmentation
• On-the-fly aggregation
E.g.
• 2015-01-01, US, Male => 1
• 2015-01-01, US, Female => 2
• 2015-01-01, UK, Male => 3
• 2015-01-01, UK, Female => 4
• 2015-01-01, UK, * => 7
• 2015-01-01, *, Male => 4
Data Model
51
Confidentia
l
Backend Architecture
53
Pinalytics
Thrift Service
2. readMetrics()
5. metrics
HBase
Region Server 1
Region Server N
Region Server 2
Region1 CP
Region2 CP
Region3 CP
Region4 CP
Region5 CP
RegionM
CP
Metric table
Webapp
Server
3. Scan &
Aggregate
1. request
4. Region
aggregation
Confidentia
l
Horizontal Scalability
• No app-level sharding
Flexibility in Aggregation
• FuzzyRowFilter
• Coprocessor
Tables
• Report metadata
• Reports
HBase
54
Confidentia
l
Composite row key
• METRIC|TIME|SEG1|SEG2|...
Filters rows given a row key and a fuzzy row
• 0: match the byte, 1: don’t match the byte
E.g. MAU of male users on 2015-01-01
• Start row: MAU|2015-01-01|
• End row: MAU|2015-01-01||
• Row Key: MAU|2015-01-01|--|M-
• Fuzzy filter: 000|0000000000|11|00
Fuzzy Row Filter
55
Confidentia
l
• Region-local aggregation with coprocessor
• Final aggregation at the Thrift service
• Reduces Network I/O
• Low Latency
HBase Coprocessor
56
Confidentia
l
Flexible python client library for generating
reports
• Arbitrary metrics and segments
Easy-to-access data
• Data is automatically copied to s3
• Hive external table is generated
Reporter
58
Confidentia
l
WAU, WARC and MAU segmented by gender and country
class DemoWAUReport(PinalyticsWideReport):
_METRIC_NAMES = ['wau', 'warc', 'mau']
_SEGKEY_NAMES = ['gender', 'country']
_QUERY_TEMPLATE = """
SELECT dt, gender, country, wau, warc, mau
FROM activity_metrics WHERE dt>='2015-01-01';"""
• Sample query output
[‘2015-01-01’, ‘male’, ‘US’, 102, 53, 110]
Reporter Example
60
Confidentia
l
• Pre-compute a lot of
core metrics
• Standard segmentation
- Gender, Country, App
- Spam-filtering
Core Metrics
62
• Activity
• Event counts
• Retention
• Signups
Confidentia
l
Outcomes
69
Confidentia
l
70
Internal Tools Matter
Solving problems inside of our company
400 Unique users
800 Page views per day
1500 Custom charts created and updated daily
Confidentia
l
Thank You

More Related Content

PPTX
Spline 2 - Vision and Architecture Overview
PDF
The Evolution of Apache Kylin by Luke Han
PPTX
Spline: Data Lineage For Spark Structured Streaming
ODP
Spline 0.3 User Guide
PPTX
Challenges in Building a Data Pipeline
PDF
Ray: Enterprise-Grade, Distributed Python
PDF
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
PDF
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Spline 2 - Vision and Architecture Overview
The Evolution of Apache Kylin by Luke Han
Spline: Data Lineage For Spark Structured Streaming
Spline 0.3 User Guide
Challenges in Building a Data Pipeline
Ray: Enterprise-Grade, Distributed Python
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
May 2021 Spark Testing ... or how to farm reputation on StackOverflow

What's hot (20)

PDF
Apache Kylin Open Source Journey for QCon2015 Beijing
PDF
Extracting Insights from Data at Twitter
PPTX
Apache Kylin Introduction
PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
PPTX
Adding Spark support to Kylin at Bay Area Spark Meetup
PPTX
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
PDF
The Apache Way - Building Open Source Community in China - Luke Han
PDF
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
PDF
Understanding and Improving Code Generation
PDF
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
PDF
Zipline - A Declarative Feature Engineering Framework
PDF
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PDF
Detecting Mobile Malware with Apache Spark with David Pryce
PDF
Continuous delivery for machine learning
PPTX
Convergent Replicated Data Types in Riak 2.0
PDF
Lambda architecture @ Indix
PPTX
Kylin OLAP Engine Tour
PDF
Parallelizing with Apache Spark in Unexpected Ways
PDF
Willump: Optimizing Feature Computation in ML Inference
Apache Kylin Open Source Journey for QCon2015 Beijing
Extracting Insights from Data at Twitter
Apache Kylin Introduction
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
Adding Spark support to Kylin at Bay Area Spark Meetup
Scylla Summit 2018: Kiwi.com Migration to Scylla - The Why, the How, the Fail...
The Apache Way - Building Open Source Community in China - Luke Han
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
Understanding and Improving Code Generation
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
Zipline - A Declarative Feature Engineering Framework
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Taking a look under the hood of Apache Flink's relational APIs.
Detecting Mobile Malware with Apache Spark with David Pryce
Continuous delivery for machine learning
Convergent Replicated Data Types in Riak 2.0
Lambda architecture @ Indix
Kylin OLAP Engine Tour
Parallelizing with Apache Spark in Unexpected Ways
Willump: Optimizing Feature Computation in ML Inference
Ad

Similar to 50 Billion pins and counting: Using Hadoop to build data driven Products (20)

PPTX
Big Data Platform at Pinterest
PPTX
Growing in the wild. The story by cubrid database developers (Esen Sagynov, E...
PPTX
Netflix Big Data Paris 2017
PDF
Azure saturday pn 2018
PPTX
Growing in the Wild. The story by CUBRID Database Developers.
PDF
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
PDF
Using ClickHouse for Experimentation
PPTX
Implementation_Big_Data_Presentation.pptx
PDF
CCI2018 - Real-time dashboard whatif analysis
PPTX
Moving advanced analytics to your sql server databases
PDF
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
PDF
PCM18 (Big Data Analytics)
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
PDF
Query generation across multiple data stores [SBTB 2016]
PPTX
Correlate Log Data with Business Metrics Like a Jedi
PDF
Tracking and business intelligence
PPTX
Improve your SQL workload with observability
PPTX
Real time monitoring of hadoop and spark workflows
PPTX
Boosting the Performance of your Rails Apps
PDF
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Big Data Platform at Pinterest
Growing in the wild. The story by cubrid database developers (Esen Sagynov, E...
Netflix Big Data Paris 2017
Azure saturday pn 2018
Growing in the Wild. The story by CUBRID Database Developers.
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Using ClickHouse for Experimentation
Implementation_Big_Data_Presentation.pptx
CCI2018 - Real-time dashboard whatif analysis
Moving advanced analytics to your sql server databases
Build and Host Real-world Machine Learning Services from Scratch @ pycontw2019
PCM18 (Big Data Analytics)
A Day in the Life of a Druid Implementor and Druid's Roadmap
Query generation across multiple data stores [SBTB 2016]
Correlate Log Data with Business Metrics Like a Jedi
Tracking and business intelligence
Improve your SQL workload with observability
Real time monitoring of hadoop and spark workflows
Boosting the Performance of your Rails Apps
Thinking DevOps in the Era of the Cloud - Demi Ben-Ari
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
Teaching material agriculture food technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Machine Learning_overview_presentation.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Spectroscopy.pptx food analysis technology
MIND Revenue Release Quarter 2 2025 Press Release
Teaching material agriculture food technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Per capita expenditure prediction using model stacking based on satellite ima...
Review of recent advances in non-invasive hemoglobin estimation
Dropbox Q2 2025 Financial Results & Investor Presentation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation_ Review paper, used for researhc scholars
A comparative analysis of optical character recognition models for extracting...
Machine Learning_overview_presentation.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Digital-Transformation-Roadmap-for-Companies.pptx

50 Billion pins and counting: Using Hadoop to build data driven Products