SlideShare a Scribd company logo
@r39132
Big Data, Fast Data @ PayPal
Sid Anand (@r39132)
YOW! Conferences (Sydney, Brisbane, Melbourne)
Nov-Dec 2018
A Data Infrastructure Story
@r39132
About Me
Worked @
Committer & PPMC on
Father of 2
Co-Chair @
Work @
@r39132
Let’s talk scale!
@r39132
@Scale: Last Year
200+ 100+
Markets Currencies
227M
Active Customer Accounts
7.8B
Payments Transactions
2,700
Applications
4,500
Engineers
17,000
Releases
200,000
Servers
27 Megawatts
Power
238 Petabytes
Storage
Full year 2017 numbers
PayPal by the Numbers!
@r39132
Putting our data scale in perspective …
PayPal by the Numbers!
DVDs7x
height
of Mt
Everest
x
500,000
x
2, 000,000
@r39132
And we continue to see growth in all areas…
PayPal by the Numbers!
@r39132
And to keep up with this growth, we’ve had to scale our data infrastructure
PayPal by the Numbers!
2,000 +
Database Instances
~116 Billion
Calls/day
~74 PB
Total Storage
OLTP DBs
Kafka
Messaging
Hadoop
Analytics
@r39132
And to keep up with this growth, we’ve had to scale our data infrastructure
PayPal by the Numbers!
2,000 +
Database Instances
~116 Billion
Calls/day
~74 PB
Total Storage
OLTP DBs
Kafka
Messaging
400+ Billion
Messages/day
~7 PB
Total Storage
50 +
Clusters
3K +
Topics
Hadoop
Analytics
@r39132
And to keep up with this growth, we’ve had to scale our data infrastructure
PayPal by the Numbers!
2,000 +
Database Instances
~116 Billion
Calls/day
~74 PB
Total Storage
OLTP DBs
Kafka
Messaging
400+ Billion
Messages/day
~7 PB
Total Storage
50 +
Clusters
3K +
Topics
200,000 +
Jobs/day
32
Hadoop Clusters
250+ PB
Storage
Hadoop
Analytics
@r39132
Interlude …
Why we love Ozzies!
• Oz has ~25MM people
• Ozzies Eligible for PayPal: ~19MM
people
• Ozzies with Active Accounts: ~7MM
• @ 37%, it’s PayPal’s most penetrated
market!!
• PayPal
@r39132
Setting the Context
To understand PayPal’s Data Infrastructure today, scale is only half the story!
It’s Data Infrastructure has evolved based on the creation of new technologies as well
as changing requirements
PayPal is a 20 year old company!
@r39132
Building A Modern Website
A Data Infrastructure Evolution Story
@r39132
Building a Modern Day Web Site
DB
CName
@r39132
DB
Load Balancer
CName
Load Balancer Load Balancer Load Balancer
Building a Modern Day Web Site
@r39132
DB
Load Balancer
CName
Load Balancer Load Balancer Load Balancer
Search
Building a Modern Day Web Site
@r39132
DB
Load Balancer
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
Building a Modern Day Web Site
@r39132
DB
Load Balancer
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
Media
Store
CName
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Media
Store
Ad-hoc ReportingDP/ML
Analytics Use-cases
1. Reporting (Nightly)
• Well-defined columns
2. Ad-hoc Analysis (throughout Day)
• Fast reads, any column
3. Data Processing / ML training
• Large scans & writes
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Impedance Mismatch
• Serving needs
• Fast reads & writes
• Well-defined workloads
• Simple queries
• Analytic (Ad-hoc) needs
• Fast reads
• Unknown workloads
• Complex
(exploratory)
queries
Media
Store
Ad-hoc ReportingDP/ML
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Impedance Mismatch
• Serving needs
• Fast reads & writes
• Well-defined workloads
• Simple queries
• OLTP DBs
• Analytic (Ad-hoc) needs
• Fast reads
• Unknown workloads
• Complex
(exploratory)
queries
• OLAP DBs
Media
Store
Ad-hoc ReportingDP/ML
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
ReportingDP/ML
Analytics Use-cases
1. Reporting (Nightly)
• Well-defined columns
2. Ad-hoc Analysis
(throughout Day)
• Fast reads, any
column
3. Data Processing / ML
training
• Large scans & writes
Media
Store
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
ReportingDP/ML
Scheduler
A workflow scheduler needs
to coordinate the
nightly/hourly loads!
Media
Store
Scheduler
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Analytics Use-cases
1. Reporting (Nightly)
• Well-defined columns
2. Ad-hoc Analysis
(throughout Day)
• Fast reads, any
column
3. Data Processing / ML
training
• Large scans & writes
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Ad-hoc
Increasingly, ad-hoc
exploratory queries are also
being moved to the data
lake to keep costs down!
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Kafka
HDFS
Ingest
Ad-hoc
What about App
engagement metric & other
business metric events?
• The web apps business
log events to Kafka
• A Kafka consumer ingest
these events into HDFS
where they can be
aggregated & possibly
also used in ML features
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
We live in a connected world.
• We can infer a lot from
what goes on around us in
our connected
neighborhood.
• Graph Processing
• Graph DBs
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Kafka
HDFS
Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Kafka
HDFS
Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Cache
And who can forget about
caches?
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Kafka
HDFS
Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Cache
And RT OLAP engines like
Apache Druid or LinkedIn’s
Pinot!
A specialty data system
optimized for time-
oriented roll-ups
RT OLAP
Building a Modern Day Web Site
@r39132
DB
CName
Load Balancer Load Balancer Load Balancer
Search
CDCIndexing
CName
Ad-hoc
DB
ETL/ELT
Reporting
Media
Store
DP/ML
HDFS HDFS HDFSScheduler
Kafka
HDFS Ingest
Ad-hoc
Graph
Processing
Graph
DBs
Graph
Ingest
Cache RT OLAP
Modern Data
Infrastructure
Building a Modern Day Web Site
@r39132
Data Infrastructure Domain Specialty Data Systems Examples
Online Serving • OLTP DBs (NoSQL, NewSQL, RDBMS)
• Caches
• Search Engines
• Graph Engines,
• Media Stores (Object, Filers)
• RT OLAP engines
• MySQL Postgres, FoundationDB
• Redis, Memcached
• Elasticsearch, SOLR
• JanusGraph, AWS Neptune, TigerGraph
• AWS S3, LinkedIn Ambry
• LinkedIn Pinot, Apache Druid
Offline Analytics • OLAP (MPP) DBs
• Graph Processing
• Large Scale Data Processing
• SQL-on-Hadoop
• Stream Processing
• ML Platforms
• BI tools (Reporting)
• Teradata, AWS Redshift, Big Query
• GraphX
• Pig, Spark, M/R
• Presto, Impala, KSQL
• Spark, Flink, Beam, Storm
• MLFlow, Kubeflow
• Tableau, Microstrategy
Data Movement • Streams
• Workflow Schedulers
• Ingesters (Graph, Search, Hadoop,
ETL/ELT)
• Kafka
• Apache Airflow, UC4, Control-M
• Sqoop, LinkedIn Gobblin, Informatica
Building a Modern Day Web Site
@r39132
Key Take-aways
• Common pitfall!
• When your primary OLTP data store is struggling under load, your
first reaction may be to
• Scale it out! Or
• Replace it with a hot new technology
@r39132
Key Take-aways
• Better approach
• Analyze the workloads & potentially
• Move different workloads to different systems
• Hire specialty talent to manage those systems
• Separate those systems by well-defined interfaces & protocols
@r39132
Key Take-aways
This is Microservices & Conway’s law applied to Data Engineering
@r39132
PayPal Data Architecture
An Overview
@r39132
PayPal’s (Core) Architecture (Simplified)
PHXSLC
2 Customer-Serving Data Centers today,
more on the way
@r39132
PayPal’s (Core) Architecture (Simplified)
PHXSLC
CName
Mobile & Web App traffic that hits paypal.com is Akamai-
routed to one of these 2 Data Centers
@r39132
PayPal’s (Core) Architecture (Simplified)
PHXSLC
CName
Load Balancer Load Balancer
Within a Data Center, we have multiple
Availability Zones.
A routing layer within the Data Center
will route to one of the Availability Zones
Each AZs is composed of many
microservices as well as other services,
such as Kafka clusters, etc…
@r39132
PayPal’s (Core) Architecture (Simplified)
PHXSLC
CName Within a Data Center, we have multiple
Availability Zones.
Load Balancer Load Balancer
A routing layer within the Data Center
will route to one of the Availability Zones
Each AZs is composed of many
microservices as well as other services,
such as Kafka clusters, etc…
OCC OCC
DB (RO)DB
DB requests are made to a single
“Horizontal” AZ that contains all of the
Core DBs (Oracle RACs)
OCC = Oracle Connection Cache
GG
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
PP has one Analytics Data Center in Las
Vegas!
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
We have 2 major data store types in our
Analytics Data Center:
• Teradata
• Hadoop
Hadoop
Teradata
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
While Reporting is primarily from
Teradata, the other use cases can hit
either store
Hadoop
Teradata
ReportingDP/MLAd-hoc
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
Custom pipelines feed both Teradata &
Hadoop from our Site DBs
Hadoop
Teradata
ReportingDP/MLAd-hoc
DB
(Pump)
OIS
CDH-R
Informatica
(ETL/ELT)
Core Data HighwayGG
GG
GG
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
Hadoop
Teradata
ReportingDP/MLAd-hoc
DB
(Pump)
OIS
CDH-R
Core Data HighwayGG
GG
GG
We have 3 schedulers today for Batch Job
execution
Scheduler
Informatica
(ETL/ELT)
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
Our home-grown Steam Donkey
transfers data between Teradata &
Hadoop
Hadoop
Teradata
ReportingDP/MLAd-hoc
DB
(Pump)
OIS
CDH-R
Core Data HighwayGG
Steam
DonkeyGG
GG
Scheduler
Informatica
(ETL/ELT)
@r39132
PHX
PayPal’s (Core) Architecture (Simplified)
CName
Load Balancer Load Balancer
=
OCC OCC
DB
SLC LVS
The remainder of this talk will focus on
the highlighted components:
• Fast Data (CDH)
• Big Data (Hadoop & More)
Hadoop
Teradata
ReportingDP/MLAd-hoc
DB
(Pump)
OIS
CDH-R
Core Data HighwayGG
GG
GG
Scheduler
Steam
Donkey
Informatica
(ETL/ELT)
@r39132
Fast Data in Action
Let’s look at a use-case
@r39132
Fast Data in Action
Say I want to send my
wife money!
@r39132
Fast Data in Action
After specifying an
amount & a message, I
hit Send
@r39132
Fast Data in Action
I see a confirmation
page
@r39132
Fast Data in Action
And I see the transfer
in my activity feed!
@r39132
Fast Data in Action
AsynchronousSynchronous
@r39132
Fast Data in Action
AsynchronousSynchronous
DB DBSynchronization
@r39132
Fast Data in Action
DB
SLC
Once the customer sees the confirmation screen, she can rest
assured the a commit has completed to the TXN database!
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
SLC • Oracle Golden Gate
reads the Redo log into
its proprietary trail file
format & streams it to
the CDH Replicat
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Avro
Schema
Registry
register
SLC • The Replicat reads the
trail file, record by record
• Extracts the db schema
of each row, converts it
into an Avro schema, and
registers that with the
Avro Schema Registry
(ASR)
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Avro
Schema
Registry
SLC • Composes an Avro
message
• Sends the message to
Kafka
Kafka
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
Avro
Schema
Registry
get
SLC
• A Storm Router gets
the Writer’s schema id
from the message
header
• Contacts the ASR to
download the schema
by id, if not in a local
cache
• Decodes the datum
using the Writer’s
schema
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
Avro
Schema
Registry
SLC • Hydrates the message
from Oracle to get all
columns (not just CDC
columns)!
Read full record by PK
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
Avro
Schema
Registry
SLC • Generates N output
messages, one per
destination, masking
sensitive columns by
destination
• Sends N messages
Read full record by PK
Kafka
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka Kafka
Avro
Schema
Registry get
• The Activity Services
consumer app follows
the same steps
previously mentioned
to decode the Avro
message
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka Kafka
Avro
Schema
Registry
• It does does some
transformation to the
data before storing it in
its own DB
DB
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka Kafka
Avro
Schema
Registry
• When you visit the
Activity mobile or web
app, your data is
retrieved from the
Activity Services DB!
DB
@r39132
Fast Data in Action
DB DB
(Pump)
CDH
Replicat
GG GG
Router
SLC
Kafka
Avro
Schema
Registry
Activity Streams -- by the
Numbers!
• Scale: hundreds of millions of
events / day
• Latency (99%ile): < 60s
• Correctness: 100%
DB
Kafka
@r39132
Take-Aways: Change Data Capture
@r39132
Why Change Data Capture?
DB DBSynchronization
Many Ways to Sync two-or-more databases:
• XA Transactions
• Event Sourcing
• Change Data Capture
@r39132
XA Transactions (a.k.a. 2-Phase Commits)
DB DB
Problem:
• Giving up Availability for consistency (CAP Theorem)
@r39132
Event Sourcing
DB DB
Problem:
• Giving up Read-Your-Write Consistency
W W W W W W W WKafka
@r39132
Change Data Capture
DB DB
Solution:
• Guaranteed eventual consistency with low-latency
@r39132
Take-Aways: Apache Avro
@r39132
Why is Avro Needed?
DB
@r39132
Why is Avro Needed?
DB
The Data Contract between Reader & Writer is
enforced by the DB via a table Schema
@r39132
Why is Avro Needed?
DB
Kafka
@r39132
Why is Avro Needed?
Avro
• Is an efficient self-describing
(schema’d) data serialization format
• Supports Schema evolution
• Has good support in most languages
• Is widely accepted in the Big & Fast
data space
• Is used for data interchange across
both streams and files (HDFS)
Kafka
@r39132
Fast Data Architecture
The Control Plane
@r39132
Fast Data Architecture
SLC
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
Read full record by PK
CDH Data
Plane
@r39132
Fast Data Architecture
Some More Requirements
• We have ~60K tables in our Oracle databases
• We can’t just turn on 60K streams as it would be wasteful, especially if no one
needs to consume it!
• We have 4500+ engineers in PayPal & 6 engineers on the CDH dev team
• How do we enable anyone in the company to launch any stream?
• If we did eventually have 60K+ streams, how would we manage them?
@r39132
SLC
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
Read full record by PK
CDH Self-
Service
Control Plane
CDH Data
Plane
Metadata
DB
Fast Data Architecture
@r39132
Fast Data Architecture
SLC
DB DB
(Pump)
GG GG
ASR
CDH Data
Plane
Metadata
DB
CDH Control
Plane
• PP Engineer visits the
CDH self-service portal (a
ReactJS app) to provision
a data pipeline
• He or she submits a
request for a new pipeline
• The provision request is
recorded in the metadata
db
@r39132
SLC
DB DB
(Pump)
GG GG
ASR
CDH Data
Plane
• A periodic Airflow job
kicks off to call an API on
the Squbs server to
execute long-running
pipeline provisioning tasks
Fast Data Architecture
Metadata
DB
CDH Control
Plane
@r39132
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
Read full record by PK
CDH Data
Plane
• This task creates new
Kafka topics, GG Replicat
processes, and Storm
topologies!
• Within a minute a new
pipeline is flowing!
Fast Data Architecture
@r39132
Design Principles
1. System built from OSS components & runs on containers (HA)!
2. Separation of Concerns:
• Intent Capture vs Orchestration
3. Orchestration is the brains of the control plane!
• DP Self-healing
• DP Auto-scaling
• Fault-tolerant actions
• Maintenance-aware
Fast Data Architecture
Metadata
DB
CDH Control
Plane
@r39132
Fast Data Requirements
Data Plane Requirements
@r39132
Fast Data Requirements
• Correctness – 0% data loss/corruption
• Latency – 99%ile < 1 minute (rain or shine)
• Availability – Always Available
@r39132
Fast Data Requirements
• Correctness – 0% data loss/corruption
• Latency – 99%ile < 1 minute (rain or shine)
• Availability – Always Available
@r39132
Fast Data Requirements
• Correctness – 0% data loss/corruption
• Causes of data loss/corruption are typically
• Deployments of Buggy Code
• Data corner-cases – latent bugs not related to recent code changes but to data outliers
• Latency – 99%ile < 1 minute (rain or shine)
• Definition of Latency SLA Misses
• Data is arriving, but it is delayed
• Causes of latency SLA Misses
• Scalability bottlenecks
• Performance bottlenecks
• Availability – Always Available
• Definition of Availability SLA Misses
• No data is arriving
• Causes of availability loss are typically
• Deployments of Buggy Code
• SPOF outages
@r39132
Fast Data Challenges
Solutions!
@r39132
Fast Data Challenges
1. Performance Bottlenecks
2. Data Corner Cases
3. Deployments of Buggy Code
@r39132
Performance Bottlenecks
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
Read full record by PK
CDH Data
Plane
• Hydration Queries
• The biggest bottleneck
is the hydration query
back to the source DB
for updated rows
• Hydration queries can
take 20-40 ms vs 500
microseconds
@r39132
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
CDH Data
Plane
• Hydration Queries
• Solution : Oracle GG
Full-Supplemental
Logging! No more
hydration!
Performance Bottlenecks
@r39132
Fast Data Challenges
1. Performance Bottlenecks
2. Data Corner Cases
3. Deployments of Buggy Code
@r39132
Data Corner Cases
• Considerations
1. A latent bug can be triggered when it encounters unexpected data!
• Approach
• We do 0 type conversions!
• Oracle Golden Gate provides everything as String
• Due to the Number type, which does not map to any numeric type in
Avro or any programming language, we had to abandon end-to-end type
safety
• The upside is that we don’t run into type-related conversion issues &
related to data corner cases!
• We don’t replicate LOB fields
• Currently, we have no transformation logic in our pipelines!
@r39132
Fast Data Challenges
1. Performance Bottlenecks
2. Data Corner Cases
3. Deployments of Buggy Code
@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane
1. Set maintenance
mode (pausing all
orchestration actions)
@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane 2. Stop Storm topology
3. Backup checkpoint
4. Deploy new code
5. Start Storm topology
6. Monitor for errors
@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane 6. If errors detected,
- a. stop topology
- b. rollback checkpoint
- c. rollback code version
- d. restart topology
@r39132
Code Deployments
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka
ASR
register
get
CDH Data
Plane
7. Set maintenance
mode off (unpausing all
orchestration actions)
@r39132
Fast Data Stats!
SLC
Metadata
DB
CDH Control
Plane
DB DB
(Pump)
CDH
Replicat
GG GG
Router
Kafka Kafka
ASR
register
get
CDH Data
Plane
Data Plane
• GA’d in August 2018
• 2.2 TB streamed / day
• 300+ Pipelines activated
through our self-service
portal!
@r39132
Closing Thoughts
• Favor microservice approaches to building data architectures
• When possible (almost always), favor OSS data projects over proprietary ones
• In stream processing. #NO_OPS is the only ways to meet SLAs
• Check out our OSS Data Projects on http://paypal.github.io/
@r39132
Acknowledgments
• Akara Sucharitakul
• Anil Gursel
• Doron Mimon
• Na Yang
• Maulin Vasavada
• Kevin Lu
• Prasanna Krishna
• Sri Shivananda
• Kamlakar Singh
• Nagendra Rai
• Swroop Singh
• Anoj Rawat
• Rahul Srivastava
• Naitra Muralykrishnan
• Prabhu Kasinathan
• Vincent Chen
• Anisha Nainani
• Pramod Garre
• Harsh Bhimani
• Nirmalya Ghosh
• Yash Shah
• Aastha Sinha
• Deepak Mohanakumar
Chandramouli
• Romit Mehta
• Dheeraj Rampally
• Stalin Subbiah
• Ashwin Nellore
• Lohit Giri
• Plamen Jeliazhov
• Sehmuz Bayhan
And Many More…
@r39132
Questions?
@r39132

More Related Content

PDF
[Cloud OnAir] Google Cloud とつなぐ色々な方法 〜 つなぐ方法をゼロからご紹介します〜 2019年1月31日 放送
PPTX
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
PDF
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
[フルバージョン] WebLogic Server for OCI 活用のご提案 - TCO削減とシステムのモダナイズ
PDF
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]
PPTX
Oracleからamazon auroraへの移行にむけて
PDF
AWS Black Belt Techシリーズ Amazon Redshift
[Cloud OnAir] Google Cloud とつなぐ色々な方法 〜 つなぐ方法をゼロからご紹介します〜 2019年1月31日 放送
How Dell Used Neo4j Graph Database to Redesign Their Pricing-as-a-Service Pla...
[Oracle DBA & Developer Day 2016] しばちょう先生の特別講義!!ストレージ管理のベストプラクティス ~ASMからExada...
Architect’s Open-Source Guide for a Data Mesh Architecture
[フルバージョン] WebLogic Server for OCI 活用のご提案 - TCO削減とシステムのモダナイズ
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2021年7月版]
Oracleからamazon auroraへの移行にむけて
AWS Black Belt Techシリーズ Amazon Redshift

What's hot (20)

PDF
Snowflake Architecture and Performance
PPT
35歳でDBAになった私がデータベースを壊して学んだこと
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PPTX
Leveraging Neo4j With Apache Spark
PPTX
PDF
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
PPTX
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
PPTX
Analyzing 1.2 Million Network Packets per Second in Real-time
PDF
Oracle GoldenGate アーキテクチャと基本機能
PDF
Oracle Database Applianceのご紹介(詳細)
PDF
elasticsearch-hadoopをつかってごにょごにょしてみる
PPTX
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
PDF
PostgreSQLのパラレル化に向けた取り組み@第30回(仮名)PostgreSQL勉強会
PPTX
Application of NLG in e commerce
PDF
ETL and Event Sourcing
PDF
Introdution to Dataops and AIOps (or MLOps)
PDF
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
PDF
rsyncやシェルでバックアップするよりも簡単にOSSのBaculaでバックアップしてみよう
PPTX
負荷分散だけじゃないELBのメリット
PDF
ELK in Security Analytics
Snowflake Architecture and Performance
35歳でDBAになった私がデータベースを壊して学んだこと
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Leveraging Neo4j With Apache Spark
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Analyzing 1.2 Million Network Packets per Second in Real-time
Oracle GoldenGate アーキテクチャと基本機能
Oracle Database Applianceのご紹介(詳細)
elasticsearch-hadoopをつかってごにょごにょしてみる
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
PostgreSQLのパラレル化に向けた取り組み@第30回(仮名)PostgreSQL勉強会
Application of NLG in e commerce
ETL and Event Sourcing
Introdution to Dataops and AIOps (or MLOps)
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
rsyncやシェルでバックアップするよりも簡単にOSSのBaculaでバックアップしてみよう
負荷分散だけじゃないELBのメリット
ELK in Security Analytics
Ad

Similar to Big Data, Fast Data @ PayPal (YOW 2018) (20)

PDF
Building data intensive applications
PDF
The Big Data Developer (@pavlobaron)
PDF
Addressing dm-cloud
PPTX
Big Data Analytics PPT - S1 working .pptx
KEY
Big data and APIs for PHP developers - SXSW 2011
PDF
Dealing with Enterprise Level Data
PPTX
Big Data PPT by Rohit Dubey
PDF
Guide to NoSQL with MySQL
PPTX
Hofstra University - Overview of Big Data
PPTX
ParStream - Big Data for Business Users
PPT
BigData & CDN - OOP2011 (Pavlo Baron)
PPTX
Lecture1
PPT
Big data.ppt
PDF
NYC Meetup November 15, 2012
PPTX
IARE_BDBA_ PPT_0.pptx
PDF
System Design.pdf
PDF
Handling the growth of data
PDF
System design handwritten notes guidance
PDF
Performance Strategies
PPTX
The Big Data Stack
Building data intensive applications
The Big Data Developer (@pavlobaron)
Addressing dm-cloud
Big Data Analytics PPT - S1 working .pptx
Big data and APIs for PHP developers - SXSW 2011
Dealing with Enterprise Level Data
Big Data PPT by Rohit Dubey
Guide to NoSQL with MySQL
Hofstra University - Overview of Big Data
ParStream - Big Data for Business Users
BigData & CDN - OOP2011 (Pavlo Baron)
Lecture1
Big data.ppt
NYC Meetup November 15, 2012
IARE_BDBA_ PPT_0.pptx
System Design.pdf
Handling the growth of data
System design handwritten notes guidance
Performance Strategies
The Big Data Stack
Ad

More from Sid Anand (20)

PDF
Building High Fidelity Data Streams (QCon London 2023)
PDF
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
PDF
Low Latency Fraud Detection & Prevention
PDF
YOW! Data Keynote (2021)
PDF
Building Better Data Pipelines using Apache Airflow
PPTX
Cloud Native Predictive Data Pipelines (micro talk)
PDF
Cloud Native Data Pipelines (GoTo Chicago 2017)
PDF
Cloud Native Data Pipelines (DataEngConf SF 2017)
PDF
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
PDF
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
PDF
Airflow @ Agari
PDF
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
PDF
Resilient Predictive Data Pipelines (QCon London 2016)
PPTX
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
PPTX
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
PPTX
Building a Modern Website for Scale (QCon NY 2013)
PDF
Hands On with Maven
PDF
Learning git
PDF
LinkedIn Data Infrastructure Slides (Version 2)
Building High Fidelity Data Streams (QCon London 2023)
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Low Latency Fraud Detection & Prevention
YOW! Data Keynote (2021)
Building Better Data Pipelines using Apache Airflow
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Introduction to Apache Airflow - Data Day Seattle 2016
Airflow @ Agari
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Building a Modern Website for Scale (QCon NY 2013)
Hands On with Maven
Learning git
LinkedIn Data Infrastructure Slides (Version 2)

Recently uploaded (20)

PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
top salesforce developer skills in 2025.pdf
PDF
medical staffing services at VALiNTRY
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Nekopoi APK 2025 free lastest update
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
System and Network Administraation Chapter 3
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
top salesforce developer skills in 2025.pdf
medical staffing services at VALiNTRY
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Softaken Excel to vCard Converter Software.pdf
Nekopoi APK 2025 free lastest update
CHAPTER 2 - PM Management and IT Context
Operating system designcfffgfgggggggvggggggggg
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
System and Network Administraation Chapter 3
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
Odoo Companies in India – Driving Business Transformation.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Internet Downloader Manager (IDM) Crack 6.42 Build 41
VVF-Customer-Presentation2025-Ver1.9.pptx
Adobe Illustrator 28.6 Crack My Vision of Vector Design

Big Data, Fast Data @ PayPal (YOW 2018)

  • 1. @r39132 Big Data, Fast Data @ PayPal Sid Anand (@r39132) YOW! Conferences (Sydney, Brisbane, Melbourne) Nov-Dec 2018 A Data Infrastructure Story
  • 2. @r39132 About Me Worked @ Committer & PPMC on Father of 2 Co-Chair @ Work @
  • 4. @r39132 @Scale: Last Year 200+ 100+ Markets Currencies 227M Active Customer Accounts 7.8B Payments Transactions 2,700 Applications 4,500 Engineers 17,000 Releases 200,000 Servers 27 Megawatts Power 238 Petabytes Storage Full year 2017 numbers PayPal by the Numbers!
  • 5. @r39132 Putting our data scale in perspective … PayPal by the Numbers! DVDs7x height of Mt Everest x 500,000 x 2, 000,000
  • 6. @r39132 And we continue to see growth in all areas… PayPal by the Numbers!
  • 7. @r39132 And to keep up with this growth, we’ve had to scale our data infrastructure PayPal by the Numbers! 2,000 + Database Instances ~116 Billion Calls/day ~74 PB Total Storage OLTP DBs Kafka Messaging Hadoop Analytics
  • 8. @r39132 And to keep up with this growth, we’ve had to scale our data infrastructure PayPal by the Numbers! 2,000 + Database Instances ~116 Billion Calls/day ~74 PB Total Storage OLTP DBs Kafka Messaging 400+ Billion Messages/day ~7 PB Total Storage 50 + Clusters 3K + Topics Hadoop Analytics
  • 9. @r39132 And to keep up with this growth, we’ve had to scale our data infrastructure PayPal by the Numbers! 2,000 + Database Instances ~116 Billion Calls/day ~74 PB Total Storage OLTP DBs Kafka Messaging 400+ Billion Messages/day ~7 PB Total Storage 50 + Clusters 3K + Topics 200,000 + Jobs/day 32 Hadoop Clusters 250+ PB Storage Hadoop Analytics
  • 10. @r39132 Interlude … Why we love Ozzies! • Oz has ~25MM people • Ozzies Eligible for PayPal: ~19MM people • Ozzies with Active Accounts: ~7MM • @ 37%, it’s PayPal’s most penetrated market!! • PayPal
  • 11. @r39132 Setting the Context To understand PayPal’s Data Infrastructure today, scale is only half the story! It’s Data Infrastructure has evolved based on the creation of new technologies as well as changing requirements PayPal is a 20 year old company!
  • 12. @r39132 Building A Modern Website A Data Infrastructure Evolution Story
  • 13. @r39132 Building a Modern Day Web Site DB CName
  • 14. @r39132 DB Load Balancer CName Load Balancer Load Balancer Load Balancer Building a Modern Day Web Site
  • 15. @r39132 DB Load Balancer CName Load Balancer Load Balancer Load Balancer Search Building a Modern Day Web Site
  • 16. @r39132 DB Load Balancer CName Load Balancer Load Balancer Load Balancer Search CDCIndexing Building a Modern Day Web Site
  • 17. @r39132 DB Load Balancer CName Load Balancer Load Balancer Load Balancer Search CDCIndexing Media Store CName Building a Modern Day Web Site
  • 18. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Media Store Ad-hoc ReportingDP/ML Analytics Use-cases 1. Reporting (Nightly) • Well-defined columns 2. Ad-hoc Analysis (throughout Day) • Fast reads, any column 3. Data Processing / ML training • Large scans & writes Building a Modern Day Web Site
  • 19. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Impedance Mismatch • Serving needs • Fast reads & writes • Well-defined workloads • Simple queries • Analytic (Ad-hoc) needs • Fast reads • Unknown workloads • Complex (exploratory) queries Media Store Ad-hoc ReportingDP/ML Building a Modern Day Web Site
  • 20. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Impedance Mismatch • Serving needs • Fast reads & writes • Well-defined workloads • Simple queries • OLTP DBs • Analytic (Ad-hoc) needs • Fast reads • Unknown workloads • Complex (exploratory) queries • OLAP DBs Media Store Ad-hoc ReportingDP/ML Building a Modern Day Web Site
  • 21. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT ReportingDP/ML Analytics Use-cases 1. Reporting (Nightly) • Well-defined columns 2. Ad-hoc Analysis (throughout Day) • Fast reads, any column 3. Data Processing / ML training • Large scans & writes Media Store Building a Modern Day Web Site
  • 22. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT ReportingDP/ML Scheduler A workflow scheduler needs to coordinate the nightly/hourly loads! Media Store Scheduler Building a Modern Day Web Site
  • 23. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Analytics Use-cases 1. Reporting (Nightly) • Well-defined columns 2. Ad-hoc Analysis (throughout Day) • Fast reads, any column 3. Data Processing / ML training • Large scans & writes Media Store DP/ML HDFS HDFS HDFSScheduler Building a Modern Day Web Site
  • 24. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Media Store DP/ML HDFS HDFS HDFSScheduler Ad-hoc Increasingly, ad-hoc exploratory queries are also being moved to the data lake to keep costs down! Building a Modern Day Web Site
  • 25. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Media Store DP/ML HDFS HDFS HDFSScheduler Kafka HDFS Ingest Ad-hoc What about App engagement metric & other business metric events? • The web apps business log events to Kafka • A Kafka consumer ingest these events into HDFS where they can be aggregated & possibly also used in ML features Building a Modern Day Web Site
  • 26. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting We live in a connected world. • We can infer a lot from what goes on around us in our connected neighborhood. • Graph Processing • Graph DBs Media Store DP/ML HDFS HDFS HDFSScheduler Kafka HDFS Ingest Ad-hoc Graph Processing Graph DBs Graph Ingest Building a Modern Day Web Site
  • 27. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Media Store DP/ML HDFS HDFS HDFSScheduler Kafka HDFS Ingest Ad-hoc Graph Processing Graph DBs Graph Ingest Cache And who can forget about caches? Building a Modern Day Web Site
  • 28. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Media Store DP/ML HDFS HDFS HDFSScheduler Kafka HDFS Ingest Ad-hoc Graph Processing Graph DBs Graph Ingest Cache And RT OLAP engines like Apache Druid or LinkedIn’s Pinot! A specialty data system optimized for time- oriented roll-ups RT OLAP Building a Modern Day Web Site
  • 29. @r39132 DB CName Load Balancer Load Balancer Load Balancer Search CDCIndexing CName Ad-hoc DB ETL/ELT Reporting Media Store DP/ML HDFS HDFS HDFSScheduler Kafka HDFS Ingest Ad-hoc Graph Processing Graph DBs Graph Ingest Cache RT OLAP Modern Data Infrastructure Building a Modern Day Web Site
  • 30. @r39132 Data Infrastructure Domain Specialty Data Systems Examples Online Serving • OLTP DBs (NoSQL, NewSQL, RDBMS) • Caches • Search Engines • Graph Engines, • Media Stores (Object, Filers) • RT OLAP engines • MySQL Postgres, FoundationDB • Redis, Memcached • Elasticsearch, SOLR • JanusGraph, AWS Neptune, TigerGraph • AWS S3, LinkedIn Ambry • LinkedIn Pinot, Apache Druid Offline Analytics • OLAP (MPP) DBs • Graph Processing • Large Scale Data Processing • SQL-on-Hadoop • Stream Processing • ML Platforms • BI tools (Reporting) • Teradata, AWS Redshift, Big Query • GraphX • Pig, Spark, M/R • Presto, Impala, KSQL • Spark, Flink, Beam, Storm • MLFlow, Kubeflow • Tableau, Microstrategy Data Movement • Streams • Workflow Schedulers • Ingesters (Graph, Search, Hadoop, ETL/ELT) • Kafka • Apache Airflow, UC4, Control-M • Sqoop, LinkedIn Gobblin, Informatica Building a Modern Day Web Site
  • 31. @r39132 Key Take-aways • Common pitfall! • When your primary OLTP data store is struggling under load, your first reaction may be to • Scale it out! Or • Replace it with a hot new technology
  • 32. @r39132 Key Take-aways • Better approach • Analyze the workloads & potentially • Move different workloads to different systems • Hire specialty talent to manage those systems • Separate those systems by well-defined interfaces & protocols
  • 33. @r39132 Key Take-aways This is Microservices & Conway’s law applied to Data Engineering
  • 35. @r39132 PayPal’s (Core) Architecture (Simplified) PHXSLC 2 Customer-Serving Data Centers today, more on the way
  • 36. @r39132 PayPal’s (Core) Architecture (Simplified) PHXSLC CName Mobile & Web App traffic that hits paypal.com is Akamai- routed to one of these 2 Data Centers
  • 37. @r39132 PayPal’s (Core) Architecture (Simplified) PHXSLC CName Load Balancer Load Balancer Within a Data Center, we have multiple Availability Zones. A routing layer within the Data Center will route to one of the Availability Zones Each AZs is composed of many microservices as well as other services, such as Kafka clusters, etc…
  • 38. @r39132 PayPal’s (Core) Architecture (Simplified) PHXSLC CName Within a Data Center, we have multiple Availability Zones. Load Balancer Load Balancer A routing layer within the Data Center will route to one of the Availability Zones Each AZs is composed of many microservices as well as other services, such as Kafka clusters, etc… OCC OCC DB (RO)DB DB requests are made to a single “Horizontal” AZ that contains all of the Core DBs (Oracle RACs) OCC = Oracle Connection Cache GG
  • 39. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS PP has one Analytics Data Center in Las Vegas!
  • 40. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS We have 2 major data store types in our Analytics Data Center: • Teradata • Hadoop Hadoop Teradata
  • 41. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS While Reporting is primarily from Teradata, the other use cases can hit either store Hadoop Teradata ReportingDP/MLAd-hoc
  • 42. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS Custom pipelines feed both Teradata & Hadoop from our Site DBs Hadoop Teradata ReportingDP/MLAd-hoc DB (Pump) OIS CDH-R Informatica (ETL/ELT) Core Data HighwayGG GG GG
  • 43. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS Hadoop Teradata ReportingDP/MLAd-hoc DB (Pump) OIS CDH-R Core Data HighwayGG GG GG We have 3 schedulers today for Batch Job execution Scheduler Informatica (ETL/ELT)
  • 44. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS Our home-grown Steam Donkey transfers data between Teradata & Hadoop Hadoop Teradata ReportingDP/MLAd-hoc DB (Pump) OIS CDH-R Core Data HighwayGG Steam DonkeyGG GG Scheduler Informatica (ETL/ELT)
  • 45. @r39132 PHX PayPal’s (Core) Architecture (Simplified) CName Load Balancer Load Balancer = OCC OCC DB SLC LVS The remainder of this talk will focus on the highlighted components: • Fast Data (CDH) • Big Data (Hadoop & More) Hadoop Teradata ReportingDP/MLAd-hoc DB (Pump) OIS CDH-R Core Data HighwayGG GG GG Scheduler Steam Donkey Informatica (ETL/ELT)
  • 46. @r39132 Fast Data in Action Let’s look at a use-case
  • 47. @r39132 Fast Data in Action Say I want to send my wife money!
  • 48. @r39132 Fast Data in Action After specifying an amount & a message, I hit Send
  • 49. @r39132 Fast Data in Action I see a confirmation page
  • 50. @r39132 Fast Data in Action And I see the transfer in my activity feed!
  • 51. @r39132 Fast Data in Action AsynchronousSynchronous
  • 52. @r39132 Fast Data in Action AsynchronousSynchronous DB DBSynchronization
  • 53. @r39132 Fast Data in Action DB SLC Once the customer sees the confirmation screen, she can rest assured the a commit has completed to the TXN database!
  • 54. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG SLC • Oracle Golden Gate reads the Redo log into its proprietary trail file format & streams it to the CDH Replicat
  • 55. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Avro Schema Registry register SLC • The Replicat reads the trail file, record by record • Extracts the db schema of each row, converts it into an Avro schema, and registers that with the Avro Schema Registry (ASR)
  • 56. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Avro Schema Registry SLC • Composes an Avro message • Sends the message to Kafka Kafka
  • 57. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router Kafka Avro Schema Registry get SLC • A Storm Router gets the Writer’s schema id from the message header • Contacts the ASR to download the schema by id, if not in a local cache • Decodes the datum using the Writer’s schema
  • 58. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router Kafka Avro Schema Registry SLC • Hydrates the message from Oracle to get all columns (not just CDC columns)! Read full record by PK
  • 59. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router Kafka Avro Schema Registry SLC • Generates N output messages, one per destination, masking sensitive columns by destination • Sends N messages Read full record by PK Kafka
  • 60. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router SLC Kafka Kafka Avro Schema Registry get • The Activity Services consumer app follows the same steps previously mentioned to decode the Avro message
  • 61. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router SLC Kafka Kafka Avro Schema Registry • It does does some transformation to the data before storing it in its own DB DB
  • 62. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router SLC Kafka Kafka Avro Schema Registry • When you visit the Activity mobile or web app, your data is retrieved from the Activity Services DB! DB
  • 63. @r39132 Fast Data in Action DB DB (Pump) CDH Replicat GG GG Router SLC Kafka Avro Schema Registry Activity Streams -- by the Numbers! • Scale: hundreds of millions of events / day • Latency (99%ile): < 60s • Correctness: 100% DB Kafka
  • 65. @r39132 Why Change Data Capture? DB DBSynchronization Many Ways to Sync two-or-more databases: • XA Transactions • Event Sourcing • Change Data Capture
  • 66. @r39132 XA Transactions (a.k.a. 2-Phase Commits) DB DB Problem: • Giving up Availability for consistency (CAP Theorem)
  • 67. @r39132 Event Sourcing DB DB Problem: • Giving up Read-Your-Write Consistency W W W W W W W WKafka
  • 68. @r39132 Change Data Capture DB DB Solution: • Guaranteed eventual consistency with low-latency
  • 70. @r39132 Why is Avro Needed? DB
  • 71. @r39132 Why is Avro Needed? DB The Data Contract between Reader & Writer is enforced by the DB via a table Schema
  • 72. @r39132 Why is Avro Needed? DB Kafka
  • 73. @r39132 Why is Avro Needed? Avro • Is an efficient self-describing (schema’d) data serialization format • Supports Schema evolution • Has good support in most languages • Is widely accepted in the Big & Fast data space • Is used for data interchange across both streams and files (HDFS) Kafka
  • 75. @r39132 Fast Data Architecture SLC DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get Read full record by PK CDH Data Plane
  • 76. @r39132 Fast Data Architecture Some More Requirements • We have ~60K tables in our Oracle databases • We can’t just turn on 60K streams as it would be wasteful, especially if no one needs to consume it! • We have 4500+ engineers in PayPal & 6 engineers on the CDH dev team • How do we enable anyone in the company to launch any stream? • If we did eventually have 60K+ streams, how would we manage them?
  • 77. @r39132 SLC DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get Read full record by PK CDH Self- Service Control Plane CDH Data Plane Metadata DB Fast Data Architecture
  • 78. @r39132 Fast Data Architecture SLC DB DB (Pump) GG GG ASR CDH Data Plane Metadata DB CDH Control Plane • PP Engineer visits the CDH self-service portal (a ReactJS app) to provision a data pipeline • He or she submits a request for a new pipeline • The provision request is recorded in the metadata db
  • 79. @r39132 SLC DB DB (Pump) GG GG ASR CDH Data Plane • A periodic Airflow job kicks off to call an API on the Squbs server to execute long-running pipeline provisioning tasks Fast Data Architecture Metadata DB CDH Control Plane
  • 80. @r39132 SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get Read full record by PK CDH Data Plane • This task creates new Kafka topics, GG Replicat processes, and Storm topologies! • Within a minute a new pipeline is flowing! Fast Data Architecture
  • 81. @r39132 Design Principles 1. System built from OSS components & runs on containers (HA)! 2. Separation of Concerns: • Intent Capture vs Orchestration 3. Orchestration is the brains of the control plane! • DP Self-healing • DP Auto-scaling • Fault-tolerant actions • Maintenance-aware Fast Data Architecture Metadata DB CDH Control Plane
  • 83. @r39132 Fast Data Requirements • Correctness – 0% data loss/corruption • Latency – 99%ile < 1 minute (rain or shine) • Availability – Always Available
  • 84. @r39132 Fast Data Requirements • Correctness – 0% data loss/corruption • Latency – 99%ile < 1 minute (rain or shine) • Availability – Always Available
  • 85. @r39132 Fast Data Requirements • Correctness – 0% data loss/corruption • Causes of data loss/corruption are typically • Deployments of Buggy Code • Data corner-cases – latent bugs not related to recent code changes but to data outliers • Latency – 99%ile < 1 minute (rain or shine) • Definition of Latency SLA Misses • Data is arriving, but it is delayed • Causes of latency SLA Misses • Scalability bottlenecks • Performance bottlenecks • Availability – Always Available • Definition of Availability SLA Misses • No data is arriving • Causes of availability loss are typically • Deployments of Buggy Code • SPOF outages
  • 87. @r39132 Fast Data Challenges 1. Performance Bottlenecks 2. Data Corner Cases 3. Deployments of Buggy Code
  • 88. @r39132 Performance Bottlenecks SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get Read full record by PK CDH Data Plane • Hydration Queries • The biggest bottleneck is the hydration query back to the source DB for updated rows • Hydration queries can take 20-40 ms vs 500 microseconds
  • 89. @r39132 SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get CDH Data Plane • Hydration Queries • Solution : Oracle GG Full-Supplemental Logging! No more hydration! Performance Bottlenecks
  • 90. @r39132 Fast Data Challenges 1. Performance Bottlenecks 2. Data Corner Cases 3. Deployments of Buggy Code
  • 91. @r39132 Data Corner Cases • Considerations 1. A latent bug can be triggered when it encounters unexpected data! • Approach • We do 0 type conversions! • Oracle Golden Gate provides everything as String • Due to the Number type, which does not map to any numeric type in Avro or any programming language, we had to abandon end-to-end type safety • The upside is that we don’t run into type-related conversion issues & related to data corner cases! • We don’t replicate LOB fields • Currently, we have no transformation logic in our pipelines!
  • 92. @r39132 Fast Data Challenges 1. Performance Bottlenecks 2. Data Corner Cases 3. Deployments of Buggy Code
  • 93. @r39132 Code Deployments SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka ASR register get CDH Data Plane 1. Set maintenance mode (pausing all orchestration actions)
  • 94. @r39132 Code Deployments SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka ASR register get CDH Data Plane 2. Stop Storm topology 3. Backup checkpoint 4. Deploy new code 5. Start Storm topology 6. Monitor for errors
  • 95. @r39132 Code Deployments SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka ASR register get CDH Data Plane 6. If errors detected, - a. stop topology - b. rollback checkpoint - c. rollback code version - d. restart topology
  • 96. @r39132 Code Deployments SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka ASR register get CDH Data Plane 7. Set maintenance mode off (unpausing all orchestration actions)
  • 97. @r39132 Fast Data Stats! SLC Metadata DB CDH Control Plane DB DB (Pump) CDH Replicat GG GG Router Kafka Kafka ASR register get CDH Data Plane Data Plane • GA’d in August 2018 • 2.2 TB streamed / day • 300+ Pipelines activated through our self-service portal!
  • 98. @r39132 Closing Thoughts • Favor microservice approaches to building data architectures • When possible (almost always), favor OSS data projects over proprietary ones • In stream processing. #NO_OPS is the only ways to meet SLAs • Check out our OSS Data Projects on http://paypal.github.io/
  • 99. @r39132 Acknowledgments • Akara Sucharitakul • Anil Gursel • Doron Mimon • Na Yang • Maulin Vasavada • Kevin Lu • Prasanna Krishna • Sri Shivananda • Kamlakar Singh • Nagendra Rai • Swroop Singh • Anoj Rawat • Rahul Srivastava • Naitra Muralykrishnan • Prabhu Kasinathan • Vincent Chen • Anisha Nainani • Pramod Garre • Harsh Bhimani • Nirmalya Ghosh • Yash Shah • Aastha Sinha • Deepak Mohanakumar Chandramouli • Romit Mehta • Dheeraj Rampally • Stalin Subbiah • Ashwin Nellore • Lohit Giri • Plamen Jeliazhov • Sehmuz Bayhan And Many More…