SlideShare a Scribd company logo
Richard Grossman | System Architect
Processing Billions
of Daily Events
What we do…
RTB
NetworksAdvertiser
Advertiser
Advertiser
Advertiser
Advertiser
$$$
2M/min 250ms
RIR Networks
SAPI
Networks
Video Networks
RAPI Networks
>Incoming requests ==> 1.5 to 2 M / Minute
>Events generated ==> 20 to 30 M / Minute
Generate 5+ TB / day raw data (CSV+Parquet)
>Storing 550 days of aggregated data
>Storing years of raw data
Numbers…
The Past
>Company traffic increased +200% from last year
>Write directly to relational DB is not an option anymore...
>Solution should support both hot and cold data
>Lambda architecture
>Cost effective
Concerns…
Our Solution
>Streaming data with Kafka
>Handle real time data with Spark Streaming
>Handle raw data with Spark Jobs over Parquet DB
>Data Scientist friendly environment using DataBricks
>Super Cost Effective
Architecture
Dstream (Discretized Stream)
Code Sample
implicit val ssc = new StreamingContext(sparkConfiguration, batchInterval)
val topicMap = Map[“Topic” → ”5”]
l>Define Streaming Context
val stream = FixedKafkaInputDStream[String, Event, KeyDecoder, ValueDecoder](ssc,
KafkaParams, topicMap, StorageLevel.MEMORY)
l>Define Dstream on Kafka
val mapped = stream flatMap { event => (gender, age) → 1 }
val reduced = mapped.reduceByKey { _ + _ }
l>Aggregate the Data (In our case reduceByKey)
Code Sample
reduced foreachRDD {
rdd => rdd.collect() foreach {
AggregatedRecords =>
val key = aggregatedRecords._1
val count = aggregatedRecords._2
INSERT INTO MYTABLE VALUES(key.age, key.gender, count) ON
DUPLICATE KEY UPDATE ….
}
}
l> Working now on RDD aggregated : Collect records then insert into MySQL
Architecture Part 2
>100 ~ 200 servers stream events to Kafka
>Spark Streaming cluster handles events in real time
(~30M/Min)
>Updating MySQL at frequency of 1500 Updates/Second
>Generate Parquet format file ~1 GB/hour
>Parquet DB accessible using “DataBricks” cluster for ad hook
queries
Infrastructure
>Running on Amazon EC2
>Kafka cluster (4 Brokers, 3 Zookeepers)
>Spark Streaming cluster (1 Master, 5 Slaves)
>“DataBricks” clusters (On Demand & Spot Instance)
>Storage on Amazon S3 & Glacier
{Thanks}

More Related Content

PDF
Aws cloud big data trends
PPTX
Implementing Real-Time IoT Stream Processing in Azure
PPTX
Openstack and eBay
PDF
Zentrales logging mit dem Elastic Stack
PDF
Zentral QueryCon 2018
PPTX
Stream Processing Live Traffic Data with Kafka Streams
PDF
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
PPTX
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Aws cloud big data trends
Implementing Real-Time IoT Stream Processing in Azure
Openstack and eBay
Zentrales logging mit dem Elastic Stack
Zentral QueryCon 2018
Stream Processing Live Traffic Data with Kafka Streams
Analyzing and processing FInancial Market Data on AWS with Kinesis - AWS Pop ...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...

What's hot (20)

PDF
Cognos tm1 Online Training
PDF
Big Data and ML on Google Cloud
PDF
AWS database solutions and open sources - 2019-03-28
PDF
A Multi-Tenancy Cloud-Native Digital Library Platform
PPTX
Serverless
PDF
Keynote -- Percona Live Europe 2018
PDF
Cloud Capacity Planning Tooling - South Bay SRE Meetup Aug-09-2016
PPTX
IronSource Atom - Redshift - Lessons Learned
PDF
Google App Engine 7 9-14
PPTX
Building big data applications on AWS by Ran Tessler
PDF
Aleksei Udatšnõi – Crunching thousands of events per second in nearly real ti...
PDF
Santa Cloud: How Netflix Does Holiday Capacity Planning - South Bay SRE Meetu...
PDF
PloneConf2017: serverless python for astronaut safety
PDF
Google Cloud Dataflow
PDF
Serverless Architecture GCP In Production
PPTX
Serverless GraphQL. AppSync 101
PDF
From logging to monitoring to reactive insights - C Schneider
PDF
Handle insane devices traffic using Google Cloud Platform - Andrea Ulisse - C...
PPTX
Microsoft Azure News - February 2018
PPTX
Keystone event processing pipeline on a dockerized microservices architecture
Cognos tm1 Online Training
Big Data and ML on Google Cloud
AWS database solutions and open sources - 2019-03-28
A Multi-Tenancy Cloud-Native Digital Library Platform
Serverless
Keynote -- Percona Live Europe 2018
Cloud Capacity Planning Tooling - South Bay SRE Meetup Aug-09-2016
IronSource Atom - Redshift - Lessons Learned
Google App Engine 7 9-14
Building big data applications on AWS by Ran Tessler
Aleksei Udatšnõi – Crunching thousands of events per second in nearly real ti...
Santa Cloud: How Netflix Does Holiday Capacity Planning - South Bay SRE Meetu...
PloneConf2017: serverless python for astronaut safety
Google Cloud Dataflow
Serverless Architecture GCP In Production
Serverless GraphQL. AppSync 101
From logging to monitoring to reactive insights - C Schneider
Handle insane devices traffic using Google Cloud Platform - Andrea Ulisse - C...
Microsoft Azure News - February 2018
Keystone event processing pipeline on a dockerized microservices architecture
Ad

Similar to Inneractive - Spark meetup2 (20)

PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
ScyllaDB Virtual Workshop
PPTX
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
PPTX
High availability, real-time and scalable architectures
PDF
Db2 event store
PDF
Designing Low-Latency Systems with Rust: An Architectural Deep Dive
PPTX
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
PDF
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
PDF
IoT NY - Google Cloud Services for IoT
PDF
Redis+Spark Structured Streaming: Roshan Kumar
PPTX
Bank of China (HK) Tech Talk 1: Dive Into Apache Kafka
PPTX
Data & analytics challenges in a microservice architecture
PDF
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
PDF
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
PPTX
Dbs302 driving a realtime personalization engine with cloud bigtable
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Transforming the Database: Critical Innovations for Performance at Scale
PPTX
Data analytics at scale implementing stateful stream processing - publish
PDF
Google Cloud Dataflow Two Worlds Become a Much Better One
PDF
MongoDB Solution for Internet of Things and Big Data
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB Virtual Workshop
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
High availability, real-time and scalable architectures
Db2 event store
Designing Low-Latency Systems with Rust: An Architectural Deep Dive
ClickHouse Paris Meetup. ClickHouse Analytical DBMS, Introduction. By Alexand...
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
IoT NY - Google Cloud Services for IoT
Redis+Spark Structured Streaming: Roshan Kumar
Bank of China (HK) Tech Talk 1: Dive Into Apache Kafka
Data & analytics challenges in a microservice architecture
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Dbs302 driving a realtime personalization engine with cloud bigtable
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Transforming the Database: Critical Innovations for Performance at Scale
Data analytics at scale implementing stateful stream processing - publish
Google Cloud Dataflow Two Worlds Become a Much Better One
MongoDB Solution for Internet of Things and Big Data
Ad

More from tsliwowicz (7)

PPTX
Spark war stories taboola
PDF
Spark on Dataproc - Israel Spark Meetup at taboola
PDF
Using apache spark to fight world hunger - Israel spark meetup at taboola
PPTX
Spark meetup2 final (Taboola)
PPTX
Spark Magic Building and Deploying a High Scale Product in 4 Months
PPTX
Taboola Road To Scale With Apache Spark
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Spark war stories taboola
Spark on Dataproc - Israel Spark Meetup at taboola
Using apache spark to fight world hunger - Israel spark meetup at taboola
Spark meetup2 final (Taboola)
Spark Magic Building and Deploying a High Scale Product in 4 Months
Taboola Road To Scale With Apache Spark
Taboola's experience with Apache Spark (presentation @ Reversim 2014)

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Transcultural that can help you someday.
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
.pdf is not working space design for the following data for the following dat...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Galatica Smart Energy Infrastructure Startup Pitch Deck
climate analysis of Dhaka ,Banglades.pptx
Mega Projects Data Mega Projects Data
Introduction-to-Cloud-ComputingFinal.pptx
modul_python (1).pptx for professional and student
IB Computer Science - Internal Assessment.pptx
Transcultural that can help you someday.
Data_Analytics_and_PowerBI_Presentation.pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction to Knowledge Engineering Part 1
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Supervised vs unsupervised machine learning algorithms
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Optimise Shopper Experiences with a Strong Data Estate.pdf
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
STUDY DESIGN details- Lt Col Maksud (21).pptx

Inneractive - Spark meetup2

  • 1. Richard Grossman | System Architect Processing Billions of Daily Events
  • 2. What we do… RTB NetworksAdvertiser Advertiser Advertiser Advertiser Advertiser $$$ 2M/min 250ms RIR Networks SAPI Networks Video Networks RAPI Networks
  • 3. >Incoming requests ==> 1.5 to 2 M / Minute >Events generated ==> 20 to 30 M / Minute Generate 5+ TB / day raw data (CSV+Parquet) >Storing 550 days of aggregated data >Storing years of raw data Numbers…
  • 5. >Company traffic increased +200% from last year >Write directly to relational DB is not an option anymore... >Solution should support both hot and cold data >Lambda architecture >Cost effective Concerns…
  • 6. Our Solution >Streaming data with Kafka >Handle real time data with Spark Streaming >Handle raw data with Spark Jobs over Parquet DB >Data Scientist friendly environment using DataBricks >Super Cost Effective
  • 9. Code Sample implicit val ssc = new StreamingContext(sparkConfiguration, batchInterval) val topicMap = Map[“Topic” → ”5”] l>Define Streaming Context val stream = FixedKafkaInputDStream[String, Event, KeyDecoder, ValueDecoder](ssc, KafkaParams, topicMap, StorageLevel.MEMORY) l>Define Dstream on Kafka val mapped = stream flatMap { event => (gender, age) → 1 } val reduced = mapped.reduceByKey { _ + _ } l>Aggregate the Data (In our case reduceByKey)
  • 10. Code Sample reduced foreachRDD { rdd => rdd.collect() foreach { AggregatedRecords => val key = aggregatedRecords._1 val count = aggregatedRecords._2 INSERT INTO MYTABLE VALUES(key.age, key.gender, count) ON DUPLICATE KEY UPDATE …. } } l> Working now on RDD aggregated : Collect records then insert into MySQL
  • 11. Architecture Part 2 >100 ~ 200 servers stream events to Kafka >Spark Streaming cluster handles events in real time (~30M/Min) >Updating MySQL at frequency of 1500 Updates/Second >Generate Parquet format file ~1 GB/hour >Parquet DB accessible using “DataBricks” cluster for ad hook queries
  • 12. Infrastructure >Running on Amazon EC2 >Kafka cluster (4 Brokers, 3 Zookeepers) >Spark Streaming cluster (1 Master, 5 Slaves) >“DataBricks” clusters (On Demand & Spot Instance) >Storage on Amazon S3 & Glacier