Inneractive - Spark meetup2

Richard Grossman | System Architect
Processing Billions
of Daily Events

What we do…
RTB
NetworksAdvertiser
Advertiser
Advertiser
Advertiser
Advertiser
$$$
2M/min 250ms
RIR Networks
SAPI
Networks
Video Networks
RAPI Networks

>Incoming requests ==> 1.5 to 2 M / Minute
>Events generated ==> 20 to 30 M / Minute
Generate 5+ TB / day raw data (CSV+Parquet)
>Storing 550 days of aggregated data
>Storing years of raw data
Numbers…

>Company traffic increased +200% from last year
>Write directly to relational DB is not an option anymore...
>Solution should support both hot and cold data
>Lambda architecture
>Cost effective
Concerns…

Our Solution
>Streaming data with Kafka
>Handle real time data with Spark Streaming
>Handle raw data with Spark Jobs over Parquet DB
>Data Scientist friendly environment using DataBricks
>Super Cost Effective

Code Sample
implicit val ssc = new StreamingContext(sparkConfiguration, batchInterval)
val topicMap = Map[“Topic” → ”5”]
l>Define Streaming Context
val stream = FixedKafkaInputDStream[String, Event, KeyDecoder, ValueDecoder](ssc,
KafkaParams, topicMap, StorageLevel.MEMORY)
l>Define Dstream on Kafka
val mapped = stream flatMap { event => (gender, age) → 1 }
val reduced = mapped.reduceByKey { _ + _ }
l>Aggregate the Data (In our case reduceByKey)

Code Sample
reduced foreachRDD {
rdd => rdd.collect() foreach {
AggregatedRecords =>
val key = aggregatedRecords._1
val count = aggregatedRecords._2
INSERT INTO MYTABLE VALUES(key.age, key.gender, count) ON
DUPLICATE KEY UPDATE ….
}
}
l> Working now on RDD aggregated : Collect records then insert into MySQL

Architecture Part 2
>100 ~ 200 servers stream events to Kafka
>Spark Streaming cluster handles events in real time
(~30M/Min)
>Updating MySQL at frequency of 1500 Updates/Second
>Generate Parquet format file ~1 GB/hour
>Parquet DB accessible using “DataBricks” cluster for ad hook
queries

Infrastructure
>Running on Amazon EC2
>Kafka cluster (4 Brokers, 3 Zookeepers)
>Spark Streaming cluster (1 Master, 5 Slaves)
>“DataBricks” clusters (On Demand & Spot Instance)
>Storage on Amazon S3 & Glacier

Inneractive - Spark meetup2

More Related Content

What's hot (20)

Similar to Inneractive - Spark meetup2 (20)

More from tsliwowicz (7)

Recently uploaded (20)

Inneractive - Spark meetup2