Building interactive audience analytics
with Spark
Eugene Zhulenev @ezhulenev
Problem definition
1 / 22
2 / 22
Impression log
Ad Site Cookie Impressions Clicks Segments
bmw_X5 forbes.com 13e835610ff0d95 10 1 [a.m, b.rk, c.rh, d.sn, ...]
mercedes_2015 forbes.com 13e8360c8e1233d 5 0 [a.f, b.rk, c.hs, d.mr, ...]
nokia gizmodo.com 13e3c97d526839c 8 0 [a.m, b.tk, c.hs, d.sn, ...]
apple_music reddit.com 1357a253f00c0ac 3 1 [a.m, b.rk, d.sn, e.gh, ...]
nokia cnn.com 13b23555294aced 2 1 [a.f, b.tk, c.rh, d.sn, ...]
apple_music facebook.com 13e8333d16d723d 9 1 [a.m, d.sn, g.gh, s.hr, ...]
*
- a.m : Male
- a.f : Female
- b.tk : $75k-$100k annual income
- b.rk : $100k-$150k annual income
- c.hs : High School
- c.rh : College
- d.sn : Single
- d.mr : Married
3 / 22
What we want to know
What is Male/Female ratio for people who have seen 'bmw_X5' ad on
forbes.com
Income distribution for people who have seen Apple Music ad
Nokia click distribution across different education levels
4 / 22
SQL solution
select count(distinct cookie_id) from impressions
where site = 'forbes.com'
and ad = 'bmw_X5'
and segment contains 'a.m'
5 / 22
SQL solution
select count(distinct cookie_id) from impressions
where site = 'forbes.com'
and ad = 'bmw_X5'
and segment contains 'a.m'
Looks pretty simple
6 / 22
SQL solution
select count(distinct cookie_id) from impressions
where site = 'forbes.com'
and ad = 'bmw_X5'
and segment contains 'a.m'
Looks pretty simple
Unfortunately it doesn't work
7 / 22
SQL solution
select count(distinct cookie_id) from impressions
where site = 'forbes.com'
and ad = 'bmw_X5'
and segment contains 'a.m'
Looks pretty simple
Unfortunately it doesn't work
It takes minutes to run this type of queries with Impala or Hive
It's impossible to pre-generate all reports, number of filter combinations
is huge
We need instant response for reporting UI
8 / 22
HyperLogLog for cardinality estimation
9 / 22
HyperLogLog
trait HyperLogLog {
def add(cookieId: String): Unit
// |A|
def cardinality(): Long
// |A ∪ B|
def merge(other: HyperLogLog): HyperLogLog
// |A ∩ B| = |A| + |B| - |A ∪ B|,
def intersect(other: HyperLogLog): Long
}
Algorithm for the count-distinct problem, approximating the number of
distinct elements (cardinality)
Uses finite space (configurable precision)
Able to estimate cardinalities of >10^9 with a typical accuracy of 2%,
using 1.5kB of memory
10 / 22
From cookies to HyperLogLog
Ad Site Cookies HLL Impressions Clicks
bmw_X5 forbes.com HyperLogLog@23sdg4 5468 35
bmw_X5 cnn.com HyperLogLog@84jdg4 8943 29
Segment Cookies HLL Impressions Clicks
Male HyperLogLog@85sdg4 235468 335
$100k-$150k HyperLogLog@35jdg4 569473 194
Percent of college and high school education in BMW campaignPercent of college and high school education in BMW campaign
val adImpressions: Seq[Audience(ad, site, hll, imp, clk)] = ...
val segmentImpressions: Seq[Segment(name, hll, imp, clk)] = ...
val bmwCookies: HyperLogLog = adImpressions
.filter(_.ad = "bmw_X5")
.map(_.hll).reduce(_ merge _)
val educatedCookies: HyperLogLog = segmentImpressions
.filter(_.segment in Seq("College", "High School"))
.map(_.hll).reduce( _ merge _)
val p = (bmwCookies intersect educatedCookies) / bmwCookies.count()
11 / 22
Spark DataFrames with HyperLogLog
12 / 22
Spark DataFrames
Inspired by R data.frame and Python/Pandas DataFrame
Distributed collection of rows organized into named columns
SchemaRDD in Spark < 1.3.0
High-Level Operations
Selecting required columns
Filtering
Joining different data sets
Aggregation (count, sum, average, etc)
13 / 22
Spark DataFrames
val adImpressions: DataFrame = sqlContext.parquetFile("/aa/audience")
adImpressions.printSchema()
// root
// | -- ad: string (nullable = true)
// | -- site: string (nullable = true)
// | -- hll: binary (nullable = true)
// | -- impressions: long (nullable = true)
// | -- clicks: long (nullable = true)
val segmentImpressions: DataFrame = sqlContext.parquetFile("/aa/segments")
segmentImpressions.printSchema()
// root
// | -- segment: string (nullable = true)
// | -- hll: binary (nullable = true)
// | -- impressions: long (nullable = true)
// | -- clicks: long (nullable = true)
14 / 22
Spark DataFrames
Percent of college and high school education in BMW campaignPercent of college and high school education in BMW campaign
import org.apache.spark.sql.functions._
import org.apache.spark.sql.HLLFunctions._
val bmwCookies: HyperLogLog = adImpressions
.filter(col("ad") === "bmw_X5")
.select(mergeHll(col("hll")).first() // -- sum(clicks)
val educatedCookies: HyperLogLog = hllSegments
.filter(col("segment") in Seq("College", "High School"))
.select(mergeHll(col("hll")).first()
val p = (bmwCookies intersect educatedCookies) / bmwCookies.count()
15 / 22
Extending DataFrames DSL
looks like 'native' DataFrame code
works faster than RDD[Row] transformations
easy to manage mutable state inside partition/merge function
case class MergeHLLPartition(child: Expression)
extends AggregateExpression with trees.UnaryNode[Expression] { ... }
case class MergeHLLMerge(child: Expression)
extends AggregateExpression with trees.UnaryNode[Expression] { ... }
case class MergeHLL(child: Expression)
extends PartialAggregate with trees.UnaryNode[Expression] {
override def asPartial: SplitEvaluation = {
val partial = Alias(MergeHLLPartition(child), "PartialMergeHLL")()
SplitEvaluation(
MergeHLLMerge(partial.toAttribute),
partial :: Nil
)
}
}
def mergeHLL(e: Column): Column = MergeHLL(e.expr)
16 / 22
Extending DataFrames DSL
Building complicated structure with DSL
type SegmentName = String
val dailyEstimates: RDD[(SegmentName, Map[LocalDate, SegmentEstimate])] =
segments.groupBy(segment_name).agg(
segment_name,
mergeDailySegmentEstimates(
mkDailySegmentEstimate( // -- Map[LocalDate, SegmentEstimate]
dt,
mkSegmentEstimate( // -- SegmentEstimate(cookieHLL, clickHLL)
cookie_hll,
click_hll)
)
)
)
HyperLogLog is a Monoid
SegmentEstimate is a Monoid
Map[K, SegmentEstimate] is a Monoid
17 / 22
Extending DataFrames DSL
custom aggregation functions needs to be in org.apache.spark.sql package
no guarantee that it will work in next Spark release
org.apache.spark.sql.catalyst.expressions.Sum as example
18 / 22
Spark as in-memory SQL database
restart overnight to load new data (data preprocessed with Hive)
cache all the data in memory
serve client requests during business day
simple Spray Http/Json Api
40 Spark worker nodes in Yarn cluster
100+ gigabytes cached in memory
average response time ~2 seconds
19 / 22
Spark configuration
spark.scheduler.mode=FAIR
spark.yarn.executor.memoryOverhead=4000
spark.sql.autoBroadcastJoinThreshold=300000000 // ~300mb
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.speculation=true
repartition datasets (4-6 partitions for core)
20 / 22
Other options
HiveHive
Too slow for interactive API backend
ImpalaImpala
Requires writing aggregation functions in C++
Not so sure that 1-2 seconds latency is easy to achieve
DruidDruid
Managing separate Druid cluster
We have batch oriented process
Bad support for some of type of queries that we need
Not clear how get data back from Druid?
21 / 22
Thank you
http://collective.com

More Related Content

PDF
Audience counting at Scale
PDF
Scalding big ADta
PDF
HyperLogLog in Hive - How to count sheep efficiently?
PDF
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
PDF
Spark-driven audience counting by Boris Trofimov
PDF
Big Audience at Scale — Spark and Big Data
PDF
Predicting model for prices of used cars
PDF
3D for the modern web: declarative3d and gltf
Audience counting at Scale
Scalding big ADta
HyperLogLog in Hive - How to count sheep efficiently?
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Spark-driven audience counting by Boris Trofimov
Big Audience at Scale — Spark and Big Data
Predicting model for prices of used cars
3D for the modern web: declarative3d and gltf

Similar to Building interactive audience analytics with Spark (20)

PDF
Best Practices & Lessons Learned from the field on EMC Documentum xCP 2.0
PDF
Leverage the power of machine learning on windows
PDF
Easy path to machine learning
ODP
Desenvolvimento Mobile Híbrido
PDF
[1D6]RE-view of Android L developer PRE-view
PDF
How to Pass the Google Analytics Individual Qualification Test by Slingshot SEO
PDF
WebGL demos showcase
DOCX
cbse 12 computer science investigatory project
DOCX
cbse 12 computer science investigatory project
PPTX
Visualizing large data sets with wijmo enterprise webinar
PPTX
Deepak-Computational Advertising-The LinkedIn Way
PPT
Introduction to 2D/3D Graphics
PDF
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
PDF
Universal Declarative Services - Simon Chemouil
PDF
Google App Engine in 40 minutes (the absolute essentials)
PPTX
The Sizmek_Tech solutions
PDF
Machine learning on streams of data
PDF
Chrysler Smart Screen
PPTX
2019 12 14 Global AI Bootcamp - Auto ML with Machine Learning.Net
PDF
Conf42 - AI Augmented Platform Engineering.pdf
Best Practices & Lessons Learned from the field on EMC Documentum xCP 2.0
Leverage the power of machine learning on windows
Easy path to machine learning
Desenvolvimento Mobile Híbrido
[1D6]RE-view of Android L developer PRE-view
How to Pass the Google Analytics Individual Qualification Test by Slingshot SEO
WebGL demos showcase
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
Visualizing large data sets with wijmo enterprise webinar
Deepak-Computational Advertising-The LinkedIn Way
Introduction to 2D/3D Graphics
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
Universal Declarative Services - Simon Chemouil
Google App Engine in 40 minutes (the absolute essentials)
The Sizmek_Tech solutions
Machine learning on streams of data
Chrysler Smart Screen
2019 12 14 Global AI Bootcamp - Auto ML with Machine Learning.Net
Conf42 - AI Augmented Platform Engineering.pdf
Ad

Building interactive audience analytics with Spark

  • 1. Building interactive audience analytics with Spark Eugene Zhulenev @ezhulenev
  • 3. 2 / 22 Impression log Ad Site Cookie Impressions Clicks Segments bmw_X5 forbes.com 13e835610ff0d95 10 1 [a.m, b.rk, c.rh, d.sn, ...] mercedes_2015 forbes.com 13e8360c8e1233d 5 0 [a.f, b.rk, c.hs, d.mr, ...] nokia gizmodo.com 13e3c97d526839c 8 0 [a.m, b.tk, c.hs, d.sn, ...] apple_music reddit.com 1357a253f00c0ac 3 1 [a.m, b.rk, d.sn, e.gh, ...] nokia cnn.com 13b23555294aced 2 1 [a.f, b.tk, c.rh, d.sn, ...] apple_music facebook.com 13e8333d16d723d 9 1 [a.m, d.sn, g.gh, s.hr, ...] * - a.m : Male - a.f : Female - b.tk : $75k-$100k annual income - b.rk : $100k-$150k annual income - c.hs : High School - c.rh : College - d.sn : Single - d.mr : Married
  • 4. 3 / 22 What we want to know What is Male/Female ratio for people who have seen 'bmw_X5' ad on forbes.com Income distribution for people who have seen Apple Music ad Nokia click distribution across different education levels
  • 5. 4 / 22 SQL solution select count(distinct cookie_id) from impressions where site = 'forbes.com' and ad = 'bmw_X5' and segment contains 'a.m'
  • 6. 5 / 22 SQL solution select count(distinct cookie_id) from impressions where site = 'forbes.com' and ad = 'bmw_X5' and segment contains 'a.m' Looks pretty simple
  • 7. 6 / 22 SQL solution select count(distinct cookie_id) from impressions where site = 'forbes.com' and ad = 'bmw_X5' and segment contains 'a.m' Looks pretty simple Unfortunately it doesn't work
  • 8. 7 / 22 SQL solution select count(distinct cookie_id) from impressions where site = 'forbes.com' and ad = 'bmw_X5' and segment contains 'a.m' Looks pretty simple Unfortunately it doesn't work It takes minutes to run this type of queries with Impala or Hive It's impossible to pre-generate all reports, number of filter combinations is huge We need instant response for reporting UI
  • 9. 8 / 22 HyperLogLog for cardinality estimation
  • 10. 9 / 22 HyperLogLog trait HyperLogLog { def add(cookieId: String): Unit // |A| def cardinality(): Long // |A ∪ B| def merge(other: HyperLogLog): HyperLogLog // |A ∩ B| = |A| + |B| - |A ∪ B|, def intersect(other: HyperLogLog): Long } Algorithm for the count-distinct problem, approximating the number of distinct elements (cardinality) Uses finite space (configurable precision) Able to estimate cardinalities of >10^9 with a typical accuracy of 2%, using 1.5kB of memory
  • 11. 10 / 22 From cookies to HyperLogLog Ad Site Cookies HLL Impressions Clicks bmw_X5 forbes.com HyperLogLog@23sdg4 5468 35 bmw_X5 cnn.com HyperLogLog@84jdg4 8943 29 Segment Cookies HLL Impressions Clicks Male HyperLogLog@85sdg4 235468 335 $100k-$150k HyperLogLog@35jdg4 569473 194 Percent of college and high school education in BMW campaignPercent of college and high school education in BMW campaign val adImpressions: Seq[Audience(ad, site, hll, imp, clk)] = ... val segmentImpressions: Seq[Segment(name, hll, imp, clk)] = ... val bmwCookies: HyperLogLog = adImpressions .filter(_.ad = "bmw_X5") .map(_.hll).reduce(_ merge _) val educatedCookies: HyperLogLog = segmentImpressions .filter(_.segment in Seq("College", "High School")) .map(_.hll).reduce( _ merge _)
  • 12. val p = (bmwCookies intersect educatedCookies) / bmwCookies.count() 11 / 22 Spark DataFrames with HyperLogLog
  • 13. 12 / 22 Spark DataFrames Inspired by R data.frame and Python/Pandas DataFrame Distributed collection of rows organized into named columns SchemaRDD in Spark < 1.3.0 High-Level Operations Selecting required columns Filtering Joining different data sets Aggregation (count, sum, average, etc)
  • 14. 13 / 22 Spark DataFrames val adImpressions: DataFrame = sqlContext.parquetFile("/aa/audience") adImpressions.printSchema() // root // | -- ad: string (nullable = true) // | -- site: string (nullable = true) // | -- hll: binary (nullable = true) // | -- impressions: long (nullable = true) // | -- clicks: long (nullable = true) val segmentImpressions: DataFrame = sqlContext.parquetFile("/aa/segments") segmentImpressions.printSchema() // root // | -- segment: string (nullable = true) // | -- hll: binary (nullable = true) // | -- impressions: long (nullable = true) // | -- clicks: long (nullable = true)
  • 15. 14 / 22 Spark DataFrames Percent of college and high school education in BMW campaignPercent of college and high school education in BMW campaign import org.apache.spark.sql.functions._ import org.apache.spark.sql.HLLFunctions._ val bmwCookies: HyperLogLog = adImpressions .filter(col("ad") === "bmw_X5") .select(mergeHll(col("hll")).first() // -- sum(clicks) val educatedCookies: HyperLogLog = hllSegments .filter(col("segment") in Seq("College", "High School")) .select(mergeHll(col("hll")).first() val p = (bmwCookies intersect educatedCookies) / bmwCookies.count()
  • 16. 15 / 22 Extending DataFrames DSL looks like 'native' DataFrame code works faster than RDD[Row] transformations easy to manage mutable state inside partition/merge function case class MergeHLLPartition(child: Expression) extends AggregateExpression with trees.UnaryNode[Expression] { ... } case class MergeHLLMerge(child: Expression) extends AggregateExpression with trees.UnaryNode[Expression] { ... } case class MergeHLL(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] { override def asPartial: SplitEvaluation = { val partial = Alias(MergeHLLPartition(child), "PartialMergeHLL")() SplitEvaluation( MergeHLLMerge(partial.toAttribute), partial :: Nil ) } }
  • 17. def mergeHLL(e: Column): Column = MergeHLL(e.expr) 16 / 22 Extending DataFrames DSL Building complicated structure with DSL type SegmentName = String val dailyEstimates: RDD[(SegmentName, Map[LocalDate, SegmentEstimate])] = segments.groupBy(segment_name).agg( segment_name, mergeDailySegmentEstimates( mkDailySegmentEstimate( // -- Map[LocalDate, SegmentEstimate] dt, mkSegmentEstimate( // -- SegmentEstimate(cookieHLL, clickHLL) cookie_hll, click_hll) ) ) ) HyperLogLog is a Monoid SegmentEstimate is a Monoid
  • 18. Map[K, SegmentEstimate] is a Monoid 17 / 22 Extending DataFrames DSL custom aggregation functions needs to be in org.apache.spark.sql package no guarantee that it will work in next Spark release org.apache.spark.sql.catalyst.expressions.Sum as example
  • 19. 18 / 22 Spark as in-memory SQL database restart overnight to load new data (data preprocessed with Hive) cache all the data in memory serve client requests during business day simple Spray Http/Json Api 40 Spark worker nodes in Yarn cluster 100+ gigabytes cached in memory average response time ~2 seconds
  • 20. 19 / 22 Spark configuration spark.scheduler.mode=FAIR spark.yarn.executor.memoryOverhead=4000 spark.sql.autoBroadcastJoinThreshold=300000000 // ~300mb spark.serializer=org.apache.spark.serializer.KryoSerializer spark.speculation=true repartition datasets (4-6 partitions for core)
  • 21. 20 / 22 Other options HiveHive Too slow for interactive API backend ImpalaImpala Requires writing aggregation functions in C++ Not so sure that 1-2 seconds latency is easy to achieve DruidDruid Managing separate Druid cluster We have batch oriented process Bad support for some of type of queries that we need Not clear how get data back from Druid?
  • 22. 21 / 22 Thank you http://collective.com