SlideShare a Scribd company logo
ULTIMATE JOURNEY TOWARDS
REALTIME DATA PLATFORM
2.5M / s
BORIS TROFIMOV @ SIGMA SOFTWARE
Leading DWH @ Oath:
Major expertise Big Data and Enterprise
Cofounder of Odessa JUG
Passionate follower of Scala
Associate professor at ONPU
ABOUT ME
WHERE IS BIG DATA?
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
INEGRATION
POINTS
INTRODUCING DATA PLATFORM
DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
BIG DATA ?
API
DATA PLATFORM
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
DATA PLATFORM
INTRODUCING DATA PLATFORM
DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS
INTRODUCING DATA PLATFORM
events
DATA PLATFORM
INTRODUCING DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS
REPORTING
ANALYTICS
Major mission: organize data
events
ZOOMING IN DATA PLATFORM
INGESTION
MODULE
REPORTING
SERVICE
WAREHOUSE
VALIDATION
ENRICHMENT
MODULE
RAW DATA
AGGREGATIONS
MODULE
FACTS
RAW DATA
DIMENSIONS
ANALYTICS
MODULE
CONFIGURATION
MODULE
DIMENSION
UPDATER
based on © https://blogs.msdn.microsoft.com/agile/2012/07/26/cqrs-journey-guidance-project-released/
7 JUNE
OUR DOMAIN
CORE
PLATFORM
DATA
PLATFORM
VIDEO PLAYERS
CONTENT OWNERS
END USERS
OUR DOMAIN
S3 Data Lake
5 PB
Vertica
500 TB
Raws/Table
600 B
Events/Sec
2.5 M
Files/Hour/Pipeline
15 K
Data/Daly
25 TB
DATA LAKE PROCESSING
ORIGINAL PIPELINE
DATA PLATFORM
VERTICAS3 HADOOPNGINX
REPORTING
SERVICE
DATA LAG ~1h
UNITED LAMBDA PLATFORM
DATA PLATFORM
KAFKA SPARK MEMSQL
VERTICAS3 HADOOPNGINX
REPORTING
SERVICE
DATA LAG ~2m
DATA LAG ~1h
UNITED LAMBDA PLATFORM
DATA PLATFORM
KAFKA SPARK MEMSQL
VERTICAS3 HADOOPNGINX
REPORTING
SERVICE
DATA LAG ~2m
DATA LAG ~1h
WHAT WAS GOOD
DATA DELIVERY TIME 2 min
FINE ON PROD SCALE @ THAT TIME -- 150K/s
PAINFUL SCALE UP TO 1M/s
НЕЛЬЗЯ ПРОСТО ТАК ВЗЯТЬ
И ВЫРАСТИ В 20 РАЗ
WHY WE NEEDED CHANGES
ROCKY SCALING
Adding/removing nodes to CDH YARN requires yarn restart and downtime for apps
Tricky to build quick sandboxes
The latest Memsql release 5.X It was not able to operate cluster with > 80 nodes
Max supported rate limit 1M events/s, while Business required 2.5M/s
ZERO TOLERANCE
EC2 faulty nodes could make Spark or Memsql get stuck for a while
Buggy HA, even one faulty node could break entire Memsql cluster, make to recreate database and lose data
PUSH approach to write data to Memsql
MONITORING & ALERTING
Find the most relevant metrics
Eliminate FALSE POSITIVE and FALSE NEGATIVE errors
ON A WAY TO 2.5 M / S
Ultimate journey towards realtime data platform with 2.5M events per sec
MIGRATING SPARK TO EMR
EASY CREATE, EASY DESTROY
• easy to … make bill cost a fortune
MULTIPLE EMR CLUSTERS
• Separating concerns and Isolation
• Better to run single application per EMR cluster
• Simplified auto-scaling rules
STATELESS EMR CLUSTERS
• Do not use local HDFS
CAUTION, EMR!
EASY TO ALLOCATE AND EASY TO LOSE EMR NODE
• Concerns mostly m4.4xl as the most popular instance type
LOSING MASTER NODE – LOSING ENTIRE CLUSTER
• Hard to build reliable platform involving multiple AZ [see Fleets model]
• Develop one-step evacuation procedure to another EMR
LUCK OF LACK ON SPECIFIC INSTANCE TYPE
• Can be mitigated by fleets model
DEPLOYMENT DETAILS
MASTER
TASK TASK TASK
…
EMR CLUSTER [YARN]
TASK TASK TASK
…
DEPLOYMENT DETAILS
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]
TASK TASK TASK
…
DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]DOCKER CONTAINER
DRIVER APP
TASK TASK TASK
…
DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]DOCKER CONTAINER
DRIVER APP
LOCAL
YARN
CONFIG
TASK TASK TASK
…
DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]DOCKER CONTAINER
DRIVER APP
TASK TASK TASK
…
LOCAL
YARN
CONFIG
DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]DOCKER CONTAINER
DRIVER APP
TASK TASK TASK
…
LOCAL
YARN
CONFIG
RANCHER
CDH vs EMR
E M RC D H
Cannot scale out/in on demand Is able to scale out/in on demand
No extra cost (for community
license)
Extra ~30% to EC2 costs
Per second billing (!)
Adding machines to CDH
requires restarting Yarn
No Yarn restart
Easy configuration management
via CM
Limited configuration available
during EMR creation
Classic Yarn cluster Ordinary Yarn under hood, imposes
EMR-driven way to deploy apps
Single CDH per AZ EMR cluster on demand as unit
of clustering
Ultimate journey towards realtime data platform with 2.5M events per sec
MAKING SPARK WRITE FASTER
USING CUSTOM HADOOP COMMITER
• FileOutputCommiter committer with V2 option to exclude file moving in HDFS/S3
WRITE DATAFRAME TO HDFS FIRST
• Spark writes to HDFS directly into partitioned folder and registers new partition in
Hive
WRITING FASTER – FILE FORMATS
MOST STABLE PERFORMANCE ON ORC UNCOMPRESSED
• spark apps writes raw data in ORC
• presto reads ORC and writes aggregations in ORC
• replication uses ORC to send delta to Vertica
BEST PERFORMANCE ON HDFS BLOCK SIZE AND STRIP 64M
• Thankfully to strict retention policy 6 hours
ENABLING hive.orc.use-column-names=true
• simplifies Spark app, allowing to write dataframe as is, presto accesses columns by name
• allows to evolve/modify schema for dataframe and database independently
Ultimate journey towards realtime data platform with 2.5M events per sec
SPARK PERFORMANCE
ONE EXECUTOR PER YARN NODE
• for better cpu and cache utilization, using 16 vcores (aligning to m4.4xl)
ALIGN RDD PARTITIONS TO VCORES
• Repartition data we read from Kakfa [address if there is a skew in kafka partitions]
SPLIT PROCESSING BATCH INTERVAL ONTO RESPONSIBILITY ZONES
• Control each interval separately
FETCH FROM KAFKA ENRICHMENT WRITE TO HIVE
1 minute
8 seconds 20 seconds 20 seconds
STUFF/OVERHEAD
12 seconds
Ultimate journey towards realtime data platform with 2.5M events per sec
FRIENDLY REMINDER
DATA PLATFORM
KAFKA SPARK MEMSQL
NGINX
REPORTING
SERVICE
INTRODUCING PRESTO
DATA PLATFORM
KAFKA SPARK PRESTO
VERTICANGINX
REPORTING
SERVICE
DATA LAG ~3m
UNDER HOOD
Aggregations and replications are running every minute
Presto uses dimensions hosted outside. Using Memsql with realtime
updates
VERTICA
NODE
REPORTING
SERVICE
SPARK, EMR
HDFS
NODE
NODE
COLLOCATED HDFS/PRESTO
PRESTO
REPLICATORS
JENKINS SCHEDULER
MEMSQL
Ultimate journey towards realtime data platform with 2.5M events per sec
FAULT TOLERANCE
EMR FLEETS MODEL
• New feature
• Allows to focus on cores instead of machines
• Allows provisioning nodes over multiple AZ
SPARK SPECULATION & BLACK LISTING
• Faulty nodes is total disaster (c)
• Spark Feature request to introduce minimal speculation interval (conflict with DirectCommiter)
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1 BATCH 2 BATCH 1
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
BATCH 3
PRESTO
BATCH 1
HDFS/HIVE
AGGREGATION TABLE
BATCH 2
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1 BATCH 2 BATCH 1
BATCH 3
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 3
BATCH 1 BATCH 2 BATCH 1
BATCH 3
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates micro-batch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
BATCH 3
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1 BATCH 2 BATCH 1
BATCH 3
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates micro-batch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACTS
BATCH 2
BATCH 3
PRESTO
HDFS/HIVE
AGGREGATED FACTS
REPLICATOR
VERTICA
AGGREGATED FACTS
BATCH 1 BATCH 2 BATCH 1
BATCH 3
BATCH 2
BATCH 3
Ultimate journey towards realtime data platform with 2.5M events per sec
Ultimate journey towards realtime data platform with 2.5M events per sec
BACKPRESSURE ENABLED
DATA PLATFORM
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
SPARK STREAMING BACKPRESSURE
• MUST HAVE for variable rate
• FEATURE contributed to Spark master with back pressure initial max rate for direct mode
BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
HDFS VALVE
• HDFS between Spark and Presto
• Retention policy 12h
BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
PULL WRITE
• Using Vertica’s query COPY from HDFS to let Vertica read data with own rate
BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
KAFKA OUTAGES
• Lua writes events directly to Kafka
• Unsent events stored locally and sent to S3
• NiFi periodically rends that data back to Kafka
Ultimate journey towards realtime data platform with 2.5M events per sec
I AM A GOD
I HAVE NO IDEA
WHAT’S GOING ON
MONITORING FUNDAMENTALS
FUNDAMENTAL REALTIME METRICS
• IN RATE
• OUT RATE
• CURRENT LAG
• ERRORS RATE
• BATCH PROCESSING TIME
• PIPELINE LATENCY
SEPARATED APP INTRODUCED [ aka BANDARLOG ]
• Tracks offsets for kafka, and Hive/Presto and Vertica
• Standalone application
• Open sourced soon
USING DATADOG
• Dashboards, monitors
DASHBOARD EXAMPLE [INGESTION]
DASHBOARD EXAMPLE [AGGREGATIONS]
Ultimate journey towards realtime data platform with 2.5M events per sec
WHAT WE HAVE ACHIEVED
SCLABLE PRODUCTION
• Ability to grow further beyond 1M/s up to 2.5M
STABLE PRODUCTION ENVIRONMENT
• fault tolerant components, easier to recover
LESS EXPENSIVE
• Smaller Spark cluster (-50%)
• Presto cluster is smaller than Memsql-driven one (30%)
SIMPLIFIED MAINTENANCE
• Auto recovery and scaling
• No wakeups over night
THANK YOU

More Related Content

PDF
Cowboy dating with big data
PDF
Scalding big ADta
PDF
fluentd -- the missing log collector
PDF
Ingesting data at scale into elasticsearch with apache pulsar
PDF
Dive into Spark Streaming
PPTX
Event Detection Pipelines with Apache Kafka
PPTX
Linked in nosql_atnetflix_2012_v1
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Cowboy dating with big data
Scalding big ADta
fluentd -- the missing log collector
Ingesting data at scale into elasticsearch with apache pulsar
Dive into Spark Streaming
Event Detection Pipelines with Apache Kafka
Linked in nosql_atnetflix_2012_v1
Building Scalable Data Pipelines - 2016 DataPalooza Seattle

What's hot (20)

PPTX
Real Time Data Processing Using Spark Streaming
ODP
Meet Up - Spark Stream Processing + Kafka
PDF
So You Want to Write a Connector?
POTX
Apache Spark Streaming: Architecture and Fault Tolerance
PDF
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
PDF
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
PDF
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PDF
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
PDF
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
PDF
Data Pipeline with Kafka
PPTX
Hive on spark is blazing fast or is it final
PDF
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
PPT
Introduction to Spark Streaming
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
PDF
Apache storm vs. Spark Streaming
PDF
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Real Time Data Processing Using Spark Streaming
Meet Up - Spark Stream Processing + Kafka
So You Want to Write a Connector?
Apache Spark Streaming: Architecture and Fault Tolerance
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Spark Streaming & Kafka-The Future of Stream Processing
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Data Pipeline with Kafka
Hive on spark is blazing fast or is it final
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Introduction to Spark Streaming
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Apache storm vs. Spark Streaming
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Ad

Similar to Ultimate journey towards realtime data platform with 2.5M events per sec (20)

PDF
Cowboy dating with big data TechDays at Lohika-2020
PDF
Cowboy dating with big data, Борис Трофімов
PDF
Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
PPTX
osi-oss-dbs.pptx
PPTX
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
PPTX
Amazon Aurora TechConnect
PDF
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
PPTX
VMworld 2015: Advanced SQL Server on vSphere
PDF
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
PDF
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
PPTX
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
PDF
How to Manage Scale-Out Environments with MariaDB MaxScale
PDF
Headaches and Breakthroughs in Building Continuous Applications
PDF
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
PDF
What no one tells you about writing a streaming app
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
PDF
Spark Uber Development Kit
PPTX
Azure Databases with IaaS
PDF
032223_Marna_I_Didnt_Know_Member of the IBM Academy of Technology.pdf
Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data, Борис Трофімов
Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
osi-oss-dbs.pptx
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Amazon Aurora TechConnect
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
VMworld 2015: Advanced SQL Server on vSphere
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
How to Manage Scale-Out Environments with MariaDB MaxScale
Headaches and Breakthroughs in Building Continuous Applications
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
What no one tells you about writing a streaming app
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark Uber Development Kit
Azure Databases with IaaS
032223_Marna_I_Didnt_Know_Member of the IBM Academy of Technology.pdf
Ad

More from b0ris_1 (13)

PDF
Learning from nature or human body as a source on inspiration for software en...
PDF
Devoxx 2022
PDF
IT Arena-2021
PDF
New accelerators in Big Data - Upsolver
PDF
Learning from nature [slides from Software Architecture meetup]
PDF
Bending Spark towards enterprise needs
PDF
Audience counting at Scale
PPTX
Scalding Big (Ad)ta
PDF
So various polymorphism in Scala
PDF
Continuous DB migration based on carbon5 framework
PPTX
Spring AOP Introduction
ODP
MongoDB Distilled
PPTX
Clustering Java applications with Terracotta and Hazelcast
Learning from nature or human body as a source on inspiration for software en...
Devoxx 2022
IT Arena-2021
New accelerators in Big Data - Upsolver
Learning from nature [slides from Software Architecture meetup]
Bending Spark towards enterprise needs
Audience counting at Scale
Scalding Big (Ad)ta
So various polymorphism in Scala
Continuous DB migration based on carbon5 framework
Spring AOP Introduction
MongoDB Distilled
Clustering Java applications with Terracotta and Hazelcast

Recently uploaded (20)

PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
PPT on Performance Review to get promotions
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Geodesy 1.pptx...............................................
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Well-logging-methods_new................
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
OOP with Java - Java Introduction (Basics)
DOCX
573137875-Attendance-Management-System-original
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Lecture Notes Electrical Wiring System Components
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT on Performance Review to get promotions
CYBER-CRIMES AND SECURITY A guide to understanding
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
bas. eng. economics group 4 presentation 1.pptx
Geodesy 1.pptx...............................................
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Operating System & Kernel Study Guide-1 - converted.pdf
Well-logging-methods_new................
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Construction Project Organization Group 2.pptx
OOP with Java - Java Introduction (Basics)
573137875-Attendance-Management-System-original
Mitigating Risks through Effective Management for Enhancing Organizational Pe...

Ultimate journey towards realtime data platform with 2.5M events per sec

  • 1. ULTIMATE JOURNEY TOWARDS REALTIME DATA PLATFORM 2.5M / s BORIS TROFIMOV @ SIGMA SOFTWARE
  • 2. Leading DWH @ Oath: Major expertise Big Data and Enterprise Cofounder of Odessa JUG Passionate follower of Scala Associate professor at ONPU ABOUT ME
  • 3. WHERE IS BIG DATA?
  • 4. API BACK OFFICE CUSTOMER WEB PORTAL MOBILE APPS INEGRATION POINTS INTRODUCING DATA PLATFORM
  • 5. DOMAIN SERVICE API BACK OFFICE CUSTOMER WEB PORTAL MOBILE APPS MICROSERVICE CORE DOMAIN SERVICES INEGRATION POINTS DOMAIN SERVICE DOMAIN SERVICE DOMAIN SERVICE INTRODUCING DATA PLATFORM
  • 6. DOMAIN SERVICE API BACK OFFICE CUSTOMER WEB PORTAL MOBILE APPS MICROSERVICE INFRASTRUCTURE SERVICES SERVICE DISCOVERY SHARED CONFIG DOMAIN DEPENDENCY MANAGEMENT ACL MANAGEMENT MICROSERVICE CORE DOMAIN SERVICES INEGRATION POINTS DOMAIN SERVICE DOMAIN SERVICE DOMAIN SERVICE INTRODUCING DATA PLATFORM
  • 7. DOMAIN SERVICE API BACK OFFICE CUSTOMER WEB PORTAL MOBILE APPS MICROSERVICE INFRASTRUCTURE SERVICES SERVICE DISCOVERY SHARED CONFIG DOMAIN DEPENDENCY MANAGEMENT ACL MANAGEMENT MICROSERVICE CORE DOMAIN SERVICES INEGRATION POINTS DOMAIN SERVICE DOMAIN SERVICE DOMAIN SERVICE INTRODUCING DATA PLATFORM BIG DATA ?
  • 8. API DATA PLATFORM BACK OFFICE CUSTOMER WEB PORTAL MOBILE APPS MICROSERVICE INFRASTRUCTURE SERVICES SERVICE DISCOVERY SHARED CONFIG DOMAIN DEPENDENCY MANAGEMENT ACL MANAGEMENT MICROSERVICE CORE DOMAIN SERVICES INEGRATION POINTS DOMAIN SERVICE DOMAIN SERVICE DOMAIN SERVICE DOMAIN SERVICE INTRODUCING DATA PLATFORM
  • 11. DATA PLATFORM INTRODUCING DATA PLATFORM 3rd PARTY PROVIDERS PLATFORM COMPONENTS REPORTING ANALYTICS Major mission: organize data events
  • 12. ZOOMING IN DATA PLATFORM INGESTION MODULE REPORTING SERVICE WAREHOUSE VALIDATION ENRICHMENT MODULE RAW DATA AGGREGATIONS MODULE FACTS RAW DATA DIMENSIONS ANALYTICS MODULE CONFIGURATION MODULE DIMENSION UPDATER
  • 13. based on © https://blogs.msdn.microsoft.com/agile/2012/07/26/cqrs-journey-guidance-project-released/
  • 16. OUR DOMAIN S3 Data Lake 5 PB Vertica 500 TB Raws/Table 600 B Events/Sec 2.5 M Files/Hour/Pipeline 15 K Data/Daly 25 TB DATA LAKE PROCESSING
  • 17. ORIGINAL PIPELINE DATA PLATFORM VERTICAS3 HADOOPNGINX REPORTING SERVICE DATA LAG ~1h
  • 18. UNITED LAMBDA PLATFORM DATA PLATFORM KAFKA SPARK MEMSQL VERTICAS3 HADOOPNGINX REPORTING SERVICE DATA LAG ~2m DATA LAG ~1h
  • 19. UNITED LAMBDA PLATFORM DATA PLATFORM KAFKA SPARK MEMSQL VERTICAS3 HADOOPNGINX REPORTING SERVICE DATA LAG ~2m DATA LAG ~1h
  • 20. WHAT WAS GOOD DATA DELIVERY TIME 2 min FINE ON PROD SCALE @ THAT TIME -- 150K/s PAINFUL SCALE UP TO 1M/s
  • 21. НЕЛЬЗЯ ПРОСТО ТАК ВЗЯТЬ И ВЫРАСТИ В 20 РАЗ
  • 22. WHY WE NEEDED CHANGES ROCKY SCALING Adding/removing nodes to CDH YARN requires yarn restart and downtime for apps Tricky to build quick sandboxes The latest Memsql release 5.X It was not able to operate cluster with > 80 nodes Max supported rate limit 1M events/s, while Business required 2.5M/s ZERO TOLERANCE EC2 faulty nodes could make Spark or Memsql get stuck for a while Buggy HA, even one faulty node could break entire Memsql cluster, make to recreate database and lose data PUSH approach to write data to Memsql MONITORING & ALERTING Find the most relevant metrics Eliminate FALSE POSITIVE and FALSE NEGATIVE errors
  • 23. ON A WAY TO 2.5 M / S
  • 25. MIGRATING SPARK TO EMR EASY CREATE, EASY DESTROY • easy to … make bill cost a fortune MULTIPLE EMR CLUSTERS • Separating concerns and Isolation • Better to run single application per EMR cluster • Simplified auto-scaling rules STATELESS EMR CLUSTERS • Do not use local HDFS
  • 26. CAUTION, EMR! EASY TO ALLOCATE AND EASY TO LOSE EMR NODE • Concerns mostly m4.4xl as the most popular instance type LOSING MASTER NODE – LOSING ENTIRE CLUSTER • Hard to build reliable platform involving multiple AZ [see Fleets model] • Develop one-step evacuation procedure to another EMR LUCK OF LACK ON SPECIFIC INSTANCE TYPE • Can be mitigated by fleets model
  • 27. DEPLOYMENT DETAILS MASTER TASK TASK TASK … EMR CLUSTER [YARN] TASK TASK TASK …
  • 28. DEPLOYMENT DETAILS MASTER S3 YARN CONFIG (zip) TASK TASK TASK … EMR CLUSTER [YARN] TASK TASK TASK …
  • 29. DEPLOYMENT DETAILS SPARK BINARIES MASTER S3 YARN CONFIG (zip) TASK TASK TASK … EMR CLUSTER [YARN]DOCKER CONTAINER DRIVER APP TASK TASK TASK …
  • 30. DEPLOYMENT DETAILS SPARK BINARIES MASTER S3 YARN CONFIG (zip) TASK TASK TASK … EMR CLUSTER [YARN]DOCKER CONTAINER DRIVER APP LOCAL YARN CONFIG TASK TASK TASK …
  • 31. DEPLOYMENT DETAILS SPARK BINARIES MASTER S3 YARN CONFIG (zip) TASK TASK TASK … EMR CLUSTER [YARN]DOCKER CONTAINER DRIVER APP TASK TASK TASK … LOCAL YARN CONFIG
  • 32. DEPLOYMENT DETAILS SPARK BINARIES MASTER S3 YARN CONFIG (zip) TASK TASK TASK … EMR CLUSTER [YARN]DOCKER CONTAINER DRIVER APP TASK TASK TASK … LOCAL YARN CONFIG RANCHER
  • 33. CDH vs EMR E M RC D H Cannot scale out/in on demand Is able to scale out/in on demand No extra cost (for community license) Extra ~30% to EC2 costs Per second billing (!) Adding machines to CDH requires restarting Yarn No Yarn restart Easy configuration management via CM Limited configuration available during EMR creation Classic Yarn cluster Ordinary Yarn under hood, imposes EMR-driven way to deploy apps Single CDH per AZ EMR cluster on demand as unit of clustering
  • 35. MAKING SPARK WRITE FASTER USING CUSTOM HADOOP COMMITER • FileOutputCommiter committer with V2 option to exclude file moving in HDFS/S3 WRITE DATAFRAME TO HDFS FIRST • Spark writes to HDFS directly into partitioned folder and registers new partition in Hive
  • 36. WRITING FASTER – FILE FORMATS MOST STABLE PERFORMANCE ON ORC UNCOMPRESSED • spark apps writes raw data in ORC • presto reads ORC and writes aggregations in ORC • replication uses ORC to send delta to Vertica BEST PERFORMANCE ON HDFS BLOCK SIZE AND STRIP 64M • Thankfully to strict retention policy 6 hours ENABLING hive.orc.use-column-names=true • simplifies Spark app, allowing to write dataframe as is, presto accesses columns by name • allows to evolve/modify schema for dataframe and database independently
  • 38. SPARK PERFORMANCE ONE EXECUTOR PER YARN NODE • for better cpu and cache utilization, using 16 vcores (aligning to m4.4xl) ALIGN RDD PARTITIONS TO VCORES • Repartition data we read from Kakfa [address if there is a skew in kafka partitions] SPLIT PROCESSING BATCH INTERVAL ONTO RESPONSIBILITY ZONES • Control each interval separately FETCH FROM KAFKA ENRICHMENT WRITE TO HIVE 1 minute 8 seconds 20 seconds 20 seconds STUFF/OVERHEAD 12 seconds
  • 40. FRIENDLY REMINDER DATA PLATFORM KAFKA SPARK MEMSQL NGINX REPORTING SERVICE
  • 41. INTRODUCING PRESTO DATA PLATFORM KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE DATA LAG ~3m
  • 42. UNDER HOOD Aggregations and replications are running every minute Presto uses dimensions hosted outside. Using Memsql with realtime updates VERTICA NODE REPORTING SERVICE SPARK, EMR HDFS NODE NODE COLLOCATED HDFS/PRESTO PRESTO REPLICATORS JENKINS SCHEDULER MEMSQL
  • 44. FAULT TOLERANCE EMR FLEETS MODEL • New feature • Allows to focus on cores instead of machines • Allows provisioning nodes over multiple AZ SPARK SPECULATION & BLACK LISTING • Faulty nodes is total disaster (c) • Spark Feature request to introduce minimal speculation interval (conflict with DirectCommiter)
  • 45. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates microbatch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACT TABLE BATCH 2 PRESTO HDFS/HIVE AGGREGATION TABLE REPLICATOR VERTICA AGGREGATION TABLE BATCH 1 BATCH 2 BATCH 1
  • 46. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates microbatch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACT TABLE BATCH 2 BATCH 3 PRESTO BATCH 1 HDFS/HIVE AGGREGATION TABLE BATCH 2 REPLICATOR VERTICA AGGREGATION TABLE BATCH 1
  • 47. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates microbatch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACT TABLE BATCH 2 PRESTO HDFS/HIVE AGGREGATION TABLE REPLICATOR VERTICA AGGREGATION TABLE BATCH 1 BATCH 2 BATCH 1 BATCH 3
  • 48. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates microbatch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACT TABLE BATCH 2 PRESTO HDFS/HIVE AGGREGATION TABLE REPLICATOR VERTICA AGGREGATION TABLE BATCH 3 BATCH 1 BATCH 2 BATCH 1 BATCH 3
  • 49. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates micro-batch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACT TABLE BATCH 2 BATCH 3 PRESTO HDFS/HIVE AGGREGATION TABLE REPLICATOR VERTICA AGGREGATION TABLE BATCH 1 BATCH 2 BATCH 1 BATCH 3
  • 50. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates micro-batch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACTS BATCH 2 BATCH 3 PRESTO HDFS/HIVE AGGREGATED FACTS REPLICATOR VERTICA AGGREGATED FACTS BATCH 1 BATCH 2 BATCH 1 BATCH 3 BATCH 2 BATCH 3
  • 53. BACKPRESSURE ENABLED DATA PLATFORM KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE
  • 54. BACKPRESSURE ENABLED KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE SPARK STREAMING BACKPRESSURE • MUST HAVE for variable rate • FEATURE contributed to Spark master with back pressure initial max rate for direct mode
  • 55. BACKPRESSURE ENABLED KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE HDFS VALVE • HDFS between Spark and Presto • Retention policy 12h
  • 56. BACKPRESSURE ENABLED KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE PULL WRITE • Using Vertica’s query COPY from HDFS to let Vertica read data with own rate
  • 57. BACKPRESSURE ENABLED KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE KAFKA OUTAGES • Lua writes events directly to Kafka • Unsent events stored locally and sent to S3 • NiFi periodically rends that data back to Kafka
  • 59. I AM A GOD I HAVE NO IDEA WHAT’S GOING ON
  • 60. MONITORING FUNDAMENTALS FUNDAMENTAL REALTIME METRICS • IN RATE • OUT RATE • CURRENT LAG • ERRORS RATE • BATCH PROCESSING TIME • PIPELINE LATENCY SEPARATED APP INTRODUCED [ aka BANDARLOG ] • Tracks offsets for kafka, and Hive/Presto and Vertica • Standalone application • Open sourced soon USING DATADOG • Dashboards, monitors
  • 64. WHAT WE HAVE ACHIEVED SCLABLE PRODUCTION • Ability to grow further beyond 1M/s up to 2.5M STABLE PRODUCTION ENVIRONMENT • fault tolerant components, easier to recover LESS EXPENSIVE • Smaller Spark cluster (-50%) • Presto cluster is smaller than Memsql-driven one (30%) SIMPLIFIED MAINTENANCE • Auto recovery and scaling • No wakeups over night