Ultimate journey towards realtime data platform with 2.5M events per sec

ULTIMATE JOURNEY TOWARDS
REALTIME DATA PLATFORM
2.5M / s
BORIS TROFIMOV @ SIGMA SOFTWARE

Leading DWH @ Oath:
Major expertise Big Data and Enterprise
Cofounder of Odessa JUG
Passionate follower of Scala
Associate professor at ONPU
ABOUT ME

API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
INEGRATION
POINTS
INTRODUCING DATA PLATFORM

DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE

DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE

DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
BIG DATA ?

API
DATA PLATFORM
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE

DATA PLATFORM

DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS
events

DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS
REPORTING
ANALYTICS
Major mission: organize data
events

ZOOMING IN DATA PLATFORM
INGESTION
MODULE
REPORTING
SERVICE
WAREHOUSE
VALIDATION
ENRICHMENT
MODULE
RAW DATA
AGGREGATIONS
MODULE
FACTS
RAW DATA
DIMENSIONS
ANALYTICS
MODULE
CONFIGURATION
MODULE
DIMENSION
UPDATER

based on © https://blogs.msdn.microsoft.com/agile/2012/07/26/cqrs-journey-guidance-project-released/

OUR DOMAIN
CORE
PLATFORM
DATA
PLATFORM
VIDEO PLAYERS
CONTENT OWNERS
END USERS

OUR DOMAIN
S3 Data Lake
5 PB
Vertica
500 TB
Raws/Table
600 B
Events/Sec
2.5 M
Files/Hour/Pipeline
15 K
Data/Daly
25 TB
DATA LAKE PROCESSING

ORIGINAL PIPELINE
DATA PLATFORM
VERTICAS3 HADOOPNGINX
REPORTING
SERVICE
DATA LAG ~1h

UNITED LAMBDA PLATFORM
DATA PLATFORM
KAFKA SPARK MEMSQL
VERTICAS3 HADOOPNGINX
REPORTING
SERVICE
DATA LAG ~2m
DATA LAG ~1h

WHAT WAS GOOD
DATA DELIVERY TIME 2 min
FINE ON PROD SCALE @ THAT TIME -- 150K/s
PAINFUL SCALE UP TO 1M/s

НЕЛЬЗЯ ПРОСТО ТАК ВЗЯТЬ
И ВЫРАСТИ В 20 РАЗ

WHY WE NEEDED CHANGES
ROCKY SCALING
Adding/removing nodes to CDH YARN requires yarn restart and downtime for apps
Tricky to build quick sandboxes
The latest Memsql release 5.X It was not able to operate cluster with > 80 nodes
Max supported rate limit 1M events/s, while Business required 2.5M/s
ZERO TOLERANCE
EC2 faulty nodes could make Spark or Memsql get stuck for a while
Buggy HA, even one faulty node could break entire Memsql cluster, make to recreate database and lose data
PUSH approach to write data to Memsql
MONITORING & ALERTING
Find the most relevant metrics
Eliminate FALSE POSITIVE and FALSE NEGATIVE errors

Ultimate journey towards realtime data platform with 2.5M events per sec

MIGRATING SPARK TO EMR
EASY CREATE, EASY DESTROY
• easy to … make bill cost a fortune
MULTIPLE EMR CLUSTERS
• Separating concerns and Isolation
• Better to run single application per EMR cluster
• Simplified auto-scaling rules
STATELESS EMR CLUSTERS
• Do not use local HDFS

CAUTION, EMR!
EASY TO ALLOCATE AND EASY TO LOSE EMR NODE
• Concerns mostly m4.4xl as the most popular instance type
LOSING MASTER NODE – LOSING ENTIRE CLUSTER
• Hard to build reliable platform involving multiple AZ [see Fleets model]
• Develop one-step evacuation procedure to another EMR
LUCK OF LACK ON SPECIFIC INSTANCE TYPE
• Can be mitigated by fleets model

DEPLOYMENT DETAILS
MASTER
TASK TASK TASK
…
EMR CLUSTER [YARN]
TASK TASK TASK
…

DEPLOYMENT DETAILS
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]
TASK TASK TASK
…

DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]DOCKER CONTAINER
DRIVER APP
TASK TASK TASK
…

DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
DRIVER APP
LOCAL
YARN
CONFIG
TASK TASK TASK
…

DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
DRIVER APP
TASK TASK TASK
…
LOCAL
YARN
CONFIG

DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
DRIVER APP
TASK TASK TASK
…
LOCAL
YARN
CONFIG
RANCHER

CDH vs EMR
E M RC D H
Cannot scale out/in on demand Is able to scale out/in on demand
No extra cost (for community
license)
Extra ~30% to EC2 costs
Per second billing (!)
Adding machines to CDH
requires restarting Yarn
No Yarn restart
Easy configuration management
via CM
Limited configuration available
during EMR creation
Classic Yarn cluster Ordinary Yarn under hood, imposes
EMR-driven way to deploy apps
Single CDH per AZ EMR cluster on demand as unit
of clustering

MAKING SPARK WRITE FASTER
USING CUSTOM HADOOP COMMITER
• FileOutputCommiter committer with V2 option to exclude file moving in HDFS/S3
WRITE DATAFRAME TO HDFS FIRST
• Spark writes to HDFS directly into partitioned folder and registers new partition in
Hive

WRITING FASTER – FILE FORMATS
MOST STABLE PERFORMANCE ON ORC UNCOMPRESSED
• spark apps writes raw data in ORC
• presto reads ORC and writes aggregations in ORC
• replication uses ORC to send delta to Vertica
BEST PERFORMANCE ON HDFS BLOCK SIZE AND STRIP 64M
• Thankfully to strict retention policy 6 hours
ENABLING hive.orc.use-column-names=true
• simplifies Spark app, allowing to write dataframe as is, presto accesses columns by name
• allows to evolve/modify schema for dataframe and database independently

SPARK PERFORMANCE
ONE EXECUTOR PER YARN NODE
• for better cpu and cache utilization, using 16 vcores (aligning to m4.4xl)
ALIGN RDD PARTITIONS TO VCORES
• Repartition data we read from Kakfa [address if there is a skew in kafka partitions]
SPLIT PROCESSING BATCH INTERVAL ONTO RESPONSIBILITY ZONES
• Control each interval separately
FETCH FROM KAFKA ENRICHMENT WRITE TO HIVE
1 minute
8 seconds 20 seconds 20 seconds
STUFF/OVERHEAD
12 seconds

FRIENDLY REMINDER
DATA PLATFORM
KAFKA SPARK MEMSQL
NGINX
REPORTING
SERVICE

INTRODUCING PRESTO
DATA PLATFORM
KAFKA SPARK PRESTO
VERTICANGINX
REPORTING
SERVICE
DATA LAG ~3m

UNDER HOOD
Aggregations and replications are running every minute
Presto uses dimensions hosted outside. Using Memsql with realtime
updates
VERTICA
NODE
REPORTING
SERVICE
SPARK, EMR
HDFS
NODE
NODE
COLLOCATED HDFS/PRESTO
PRESTO
REPLICATORS
JENKINS SCHEDULER
MEMSQL

FAULT TOLERANCE
EMR FLEETS MODEL
• New feature
• Allows to focus on cores instead of machines
• Allows provisioning nodes over multiple AZ
SPARK SPECULATION & BLACK LISTING
• Faulty nodes is total disaster (c)
• Spark Feature request to introduce minimal speculation interval (conflict with DirectCommiter)

FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1 BATCH 2 BATCH 1

FAULT TOLERANCE
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
BATCH 3
PRESTO
BATCH 1
HDFS/HIVE
AGGREGATION TABLE
BATCH 2
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1

FAULT TOLERANCE
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 3

FAULT TOLERANCE
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 3
BATCH 3

FAULT TOLERANCE
• Spark associates micro-batch with batch_id [timestamp]
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
BATCH 3
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 3

FAULT TOLERANCE
• Spark associates micro-batch with batch_id [timestamp]
SPARK
BATCH 1
HDFS/HIVE
RAW FACTS
BATCH 2
BATCH 3
PRESTO
HDFS/HIVE
AGGREGATED FACTS
REPLICATOR
VERTICA
AGGREGATED FACTS
BATCH 3
BATCH 2
BATCH 3

BACKPRESSURE ENABLED
DATA PLATFORM
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE

KAFKA SPARK PRESTO
SERVICE
SPARK STREAMING BACKPRESSURE
• MUST HAVE for variable rate
• FEATURE contributed to Spark master with back pressure initial max rate for direct mode

KAFKA SPARK PRESTO
SERVICE
HDFS VALVE
• HDFS between Spark and Presto
• Retention policy 12h

KAFKA SPARK PRESTO
SERVICE
PULL WRITE
• Using Vertica’s query COPY from HDFS to let Vertica read data with own rate

KAFKA SPARK PRESTO
SERVICE
KAFKA OUTAGES
• Lua writes events directly to Kafka
• Unsent events stored locally and sent to S3
• NiFi periodically rends that data back to Kafka

I AM A GOD
I HAVE NO IDEA
WHAT’S GOING ON

MONITORING FUNDAMENTALS
FUNDAMENTAL REALTIME METRICS
• IN RATE
• OUT RATE
• CURRENT LAG
• ERRORS RATE
• BATCH PROCESSING TIME
• PIPELINE LATENCY
SEPARATED APP INTRODUCED [ aka BANDARLOG ]
• Tracks offsets for kafka, and Hive/Presto and Vertica
• Standalone application
• Open sourced soon
USING DATADOG
• Dashboards, monitors

DASHBOARD EXAMPLE [AGGREGATIONS]

WHAT WE HAVE ACHIEVED
SCLABLE PRODUCTION
• Ability to grow further beyond 1M/s up to 2.5M
STABLE PRODUCTION ENVIRONMENT
• fault tolerant components, easier to recover
LESS EXPENSIVE
• Smaller Spark cluster (-50%)
• Presto cluster is smaller than Memsql-driven one (30%)
SIMPLIFIED MAINTENANCE
• Auto recovery and scaling
• No wakeups over night

Ultimate journey towards realtime data platform with 2.5M events per sec

More Related Content

What's hot (20)

Similar to Ultimate journey towards realtime data platform with 2.5M events per sec (20)

More from b0ris_1 (13)

Recently uploaded (20)

Ultimate journey towards realtime data platform with 2.5M events per sec