AUDIENCE COUNTING @ SCALE
Boris Trofimoff
Sigma Software & Collective
@b0ris_1
AGENDA
About our customer1
Motivation or Why & What we are counting2
Counting Fundamentals3
Counting with Spark4
Spark Road Notes5
 Place where advertisers and end-users meet each user
 Collective Media is a full stack cookie serving company
USERS
 ~1B active user profiles
MODELS
 1000s of models built weekly
PREDICTIONS
100s of billions predictions daily
MODELING AT SCALEVOLUME
Petabytes of data used
VARIETY
Profiles, formats, screens
VELOCITY
100k+ requests per second
20 billions events per day
VERACITY
Robust measurements
MOTIVATION
HOW AUDIENCE IS CREATED
IMPRESSION LOG
AD SITE COOKIE IMPRESSIONS CLICKS SEGMENTS
bmw_X5 forbes.com 13e835610ff0d95 10 1 [a.m, b.rk, c.rh, d.sn, ...]
mercedes_2015 forbes.com 13e8360c8e1233d 5 0 [a.f, b.rk, c.hs, d.mr, ...]
nokia gizmodo.com 13e3c97d526839c 8 0 [a.m, b.tk, c.hs, d.sn, ...]
apple_music reddit.com 1357a253f00c0ac 3 1 [a.m, b.rk, d.sn, e.gh, ...]
nokia cnn.com 13b23555294aced 2 1 [a.f, b.tk, c.rh, d.sn, ...]
apple_music facebook.com 13e8333d16d723d 9 1 [a.m, d.sn, g.gh, s.hr, ...]
SEGMENT EXAMPLES
SEGMENT DESCRIPTION
a.m Male
a.f Female
b.tk $75k-$100k annual income
b.rk $100k-$150k annual income
c.hs High School
c.rh College
d.sn Single
d.mr Married
WHAT WE CAN DO WITH DATA
What is male/female ratio for people who have seen bmw_X5 ad
on forbes.com
Income distribution for people who have seen Apple Music ad
Nokia click distribution across different education levels
BUILDING AUDIENCE PROFILE
COUNTING
FUNAMENTALS
SQL?
SELECT count(distinct cookie_id)
FROM impressions
WHERE site = 'forbes.com' AND ad = 'bmw_X5' AND segment contains 'a.m'
Infinite combinations
Big Data => Big Latency for Hive,
Impala and Druid
CARDINALITY ESTIMATION ALGORITHMS
ACCURACY
MEMORY EFFICIENCY
ESTIMATE LARGE CARDINALITIES
PRACTIALITY
For a fixed amount of memory, the algorithm should provide
as accurate an estimate as possible. Especially for small
cardinalities, the results should be near exact
The algorithm should use the available memory efficiently and
adapt its memory usage to the cardinality
Multisets with cardinalities well beyond 1 billion occur on a
daily basis, and it is important that such large cardinalities can
be estimated with reasonable accuracy
The algorithm should be implementable and maintainable
HYPERLOGLOG AND OTHERS
AUDIENCE CARDINALITY
APPROXIMATION WITH HYPERLOGLOG
Create Audience of people addressed by
unique identifiers (cookies)
Create Audience “Hash Sum” file with fixed
size regardless of audience size
Cardinalities ~ 109
with a typical accuracy of 2%
using 1.5KB of memory.
1.5KB
Create
Audience
Create
Hash
HYPERLOGLOG OPERATIONS
trait HyperLogLog {
def add(cookieId: String): Unit
// |A|
def cardinality(): Long
// |A ∪ B|
def merge(other: HyperLogLog): HyperLogLog
// |A ∩ B| = |A| + |B| - |A ∪ B|,
def intersect(other: HyperLogLog): Long
}
∪ ~ merge =
1.5KB 1.5KB 1.5KB
∩ ~ intrsct =
1.5KB 1.5KB 1.5KB
| |
COUNTING
WITH SPARK
FROM COOKIES TO HYPERLOGLOG
AD SITE COOKIES HLL IMPRESSIONS CLICKS
bmw_X5 forbes.com HyperLogLog@23sdg4 5468 35
bmw_X5 cnn.com HyperLogLog@84jdg4 8943 29
SEGMENT COOKIES HLL IMPRESSIONS CLICKS
Male HyperLogLog@65xzx2 235468 335
$100k-$150k HyperLogLog@12das1 569473 194
DATA FRAMES
val adImpressions: DataFrame = sqlContext.parquetFile("/aa/${yy-mm-dd}/${hh}/audience")
adImpressions.printSchema()
// root
// | -- ad: string (nullable = true)
// | -- site: string (nullable = true)
// | -- hll: binary (nullable = true)
// | -- impressions: long (nullable = true)
// | -- clicks: long (nullable = true)
val segmentImpressions: DataFrame = sqlContext.parquetFile("/aa/${yy-mm-dd}/${hh}/segments")
segmentImpressions.printSchema()
// root
// | -- segment: string (nullable = true)
// | -- hll: binary (nullable = true)
// | -- impressions: long (nullable = true)
// | -- clicks: long (nullable = true)
LET’S COUNT SOMETHING
import org.apache.spark.sql.functions._
import org.apache.spark.sql.HLLFunctions._
val bmwCookies: HyperLogLog = adImpressions
.filter(col("ad") === "bmw_X5")
.select(mergeHll(col("hll")).first() // -- sum(clicks)
val educatedCookies: HyperLogLog = hllSegments
.filter(col("segment") in Seq("College", "High School"))
.select(mergeHll(col("hll")).first()
val p = (bmwCookies intersect educatedCookies).cardinality() / bmwCookies.cardinality()
Percent of college and high school education in BMW campaign?
SPARK
ROAD NOTES
WRITING OWN SPARK
AGGREGATION FUNCTIONS
case class MergeHLLPartition(child: Expression)
extends AggregateExpression with trees.UnaryNode[Expression] { ... }
case class MergeHLLMerge(child: Expression)
extends AggregateExpression with trees.UnaryNode[Expression] { ... }
case class MergeHLL(child: Expression)
extends PartialAggregate with trees.UnaryNode[Expression] {
override def asPartial: SplitEvaluation = {
val partial = Alias(MergeHLLPartition(child), "PartialMergeHLL")()
SplitEvaluation(
MergeHLLMerge(partial.toAttribute),
partial :: Nil )
}
}
def mergeHLL(e: Column): Column = MergeHLL(e.expr)
define function that will be
applied to each row
in RDD partition
define function that will take
results from different partitions
and merge them together
tell Spark how you want it to
split your computation
across RDD
AGGREGATION FUNCTIONS
PROS & CONS
Simple DSL and Native DataFrame look-like functions
Works much faster than solving this problem with Scala transformations on
top of RDD[Row]
Dramatic Performance Speed-Up via mutable state control (10x times)
UDF should be part of private Spark package, risk these interfaces might be
changed/abandoned in the future.
SPARK AS IN-MEMORY SQL DATABASE
BATCH-DRIVEN APP LONG-RUNNING APPCHANGE
Create
SparkContext
Run
Calculations
Destloy
SparkContext
Show
Result
Load
Data
Cache it
In memory
Receive
Request
Create
SparkContext
Show
Result
Run
Calculations
~ 500 GB (1 year history)
~40N occupied from ~200N cluster
Response time 1-2 seconds
Destloy
SparkContext
REFERENCES
 http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-
with-spark-and-hyperloglog/
(Especial thanks to Eugene Zhulenev for his brilliant blog post)
 https://github.com/collectivemedia/spark-hyperloglog
 http://research.google.com/pubs/pub40671.html
 https://github.com/AdRoll/cantor
 http://tech.adroll.com/blog/data/2013/07/10/hll-minhash.html
THANK YOU!

More Related Content

PDF
Big Audience at Scale — Spark and Big Data
PDF
From CRUD to Hypermedia APIs with Spring
PPTX
Fleet repositioning
PPTX
PDF
3D Printing in Europe
PPS
Filling the gap
PDF
Randy Kurniawan CV
DOCX
Prelim evaluation
Big Audience at Scale — Spark and Big Data
From CRUD to Hypermedia APIs with Spring
Fleet repositioning
3D Printing in Europe
Filling the gap
Randy Kurniawan CV
Prelim evaluation

Viewers also liked (12)

PPT
Презентацыя
PPT
диктатура гая юлія цезаря
PPTX
Архитектура гибридных решений на SharePoint 2013 и SharePoint online
PDF
investify Presentation
PDF
Menús de carta 2016 Restaurante Manolín de Valladolid
PPTX
JShell: An Interactive Shell for the Java Platform
PPTX
Ppt penyelewengan pengadaan barang dan jasa
PDF
Bot × 翻訳 次世代グローバルチャット
PPTX
Prepare for SharePoint 2016 - IT Pro best practices for managing your SharePo...
PDF
ブラック・ショールズ式および金融資産過程の分析に向けて
DOCX
Il viaggio di enea
PDF
Инструменты дизайн-мышления в бизнесе
Презентацыя
диктатура гая юлія цезаря
Архитектура гибридных решений на SharePoint 2013 и SharePoint online
investify Presentation
Menús de carta 2016 Restaurante Manolín de Valladolid
JShell: An Interactive Shell for the Java Platform
Ppt penyelewengan pengadaan barang dan jasa
Bot × 翻訳 次世代グローバルチャット
Prepare for SharePoint 2016 - IT Pro best practices for managing your SharePo...
ブラック・ショールズ式および金融資産過程の分析に向けて
Il viaggio di enea
Инструменты дизайн-мышления в бизнесе
Ad

Similar to Spark-driven audience counting by Boris Trofimov (20)

PDF
Audience counting at Scale
PDF
Building interactive audience analytics with Spark
PDF
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
DOCX
AmiBroker AFL to DLL Conversion
PDF
Understanding the TCO and ROI of Apache Kafka & Confluent
PPT
Site Optimizer Presentation.ppt
PDF
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
PDF
Adtech scala-performance-tuning-150323223738-conversion-gate01
PDF
Adtech x Scala x Performance tuning
PDF
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
PDF
James Turner (Caplin) - Enterprise HTML5 Patterns
PDF
mlflow: Accelerating the End-to-End ML lifecycle
PDF
REX Hadoop et R
PDF
Patterns & Practices for Cloud-based Microservices
PDF
Evolve13 cq-commerce-framework
PDF
EVOLVE'13 | Enhance | Ecommerce Framework | Paolo Mottadelli
DOCX
Core competencies
PDF
Introducing Neo4j 3.0
PPTX
Adobe Business.pptx
PDF
Easy path to machine learning (Spring 2021)
Audience counting at Scale
Building interactive audience analytics with Spark
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
AmiBroker AFL to DLL Conversion
Understanding the TCO and ROI of Apache Kafka & Confluent
Site Optimizer Presentation.ppt
Why we don’t use the Term DevOps: the Journey to a Product Mindset - Destinat...
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech x Scala x Performance tuning
Why we don’t use the Term DevOps: the Journey to a Product Mindset - DevOpsCo...
James Turner (Caplin) - Enterprise HTML5 Patterns
mlflow: Accelerating the End-to-End ML lifecycle
REX Hadoop et R
Patterns & Practices for Cloud-based Microservices
Evolve13 cq-commerce-framework
EVOLVE'13 | Enhance | Ecommerce Framework | Paolo Mottadelli
Core competencies
Introducing Neo4j 3.0
Adobe Business.pptx
Easy path to machine learning (Spring 2021)
Ad

More from JavaDayUA (20)

PDF
STEMing Kids: One workshop at a time
PDF
Flavors of Concurrency in Java
PDF
What to expect from Java 9
PDF
Continuously building, releasing and deploying software: The Revenge of the M...
PDF
The Epic Groovy Puzzlers S02: The Revenge of the Parentheses
PDF
20 Years of Java
PDF
How to get the most out of code reviews
PDF
Unlocking the Magic of Monads with Java 8
PDF
Virtual Private Cloud with container technologies for DevOps
PPTX
Interactive Java Support to your tool -- The JShell API and Architecture
PDF
MapDB - taking Java collections to the next level
PDF
Save Java memory
PDF
Design rationales in the JRockit JVM
PPTX
Next-gen DevOps engineering with Docker and Kubernetes by Antons Kranga
PPTX
Apache Cassandra. Inception - all you need to know by Mikhail Dubkov
PPTX
Solution Architecture tips & tricks by Roman Shramkov
PPTX
Testing in Legacy: from Rags to Riches by Taras Slipets
PDF
Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
PDF
API first with Swagger and Scala by Slava Schmidt
PPTX
JavaFX 8 everywhere; write once run anywhere by Mohamed Taman
STEMing Kids: One workshop at a time
Flavors of Concurrency in Java
What to expect from Java 9
Continuously building, releasing and deploying software: The Revenge of the M...
The Epic Groovy Puzzlers S02: The Revenge of the Parentheses
20 Years of Java
How to get the most out of code reviews
Unlocking the Magic of Monads with Java 8
Virtual Private Cloud with container technologies for DevOps
Interactive Java Support to your tool -- The JShell API and Architecture
MapDB - taking Java collections to the next level
Save Java memory
Design rationales in the JRockit JVM
Next-gen DevOps engineering with Docker and Kubernetes by Antons Kranga
Apache Cassandra. Inception - all you need to know by Mikhail Dubkov
Solution Architecture tips & tricks by Roman Shramkov
Testing in Legacy: from Rags to Riches by Taras Slipets
Reactive programming and Hystrix fault tolerance by Max Myslyvtsev
API first with Swagger and Scala by Slava Schmidt
JavaFX 8 everywhere; write once run anywhere by Mohamed Taman

Recently uploaded (20)

PPTX
Benefits of Physical activity for teenagers.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
The various Industrial Revolutions .pptx
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
DOCX
search engine optimization ppt fir known well about this
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
STKI Israel Market Study 2025 version august
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Five Habits of High-Impact Board Members
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Getting Started with Data Integration: FME Form 101
Benefits of Physical activity for teenagers.pptx
1 - Historical Antecedents, Social Consideration.pdf
Taming the Chaos: How to Turn Unstructured Data into Decisions
Getting started with AI Agents and Multi-Agent Systems
A comparative study of natural language inference in Swahili using monolingua...
Assigned Numbers - 2025 - Bluetooth® Document
The various Industrial Revolutions .pptx
Tartificialntelligence_presentation.pptx
Group 1 Presentation -Planning and Decision Making .pptx
search engine optimization ppt fir known well about this
WOOl fibre morphology and structure.pdf for textiles
STKI Israel Market Study 2025 version august
A review of recent deep learning applications in wood surface defect identifi...
Five Habits of High-Impact Board Members
Enhancing emotion recognition model for a student engagement use case through...
Module 1.ppt Iot fundamentals and Architecture
Web Crawler for Trend Tracking Gen Z Insights.pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Getting Started with Data Integration: FME Form 101

Spark-driven audience counting by Boris Trofimov

  • 1. AUDIENCE COUNTING @ SCALE Boris Trofimoff Sigma Software & Collective @b0ris_1
  • 2. AGENDA About our customer1 Motivation or Why & What we are counting2 Counting Fundamentals3 Counting with Spark4 Spark Road Notes5
  • 3.  Place where advertisers and end-users meet each user  Collective Media is a full stack cookie serving company USERS  ~1B active user profiles MODELS  1000s of models built weekly PREDICTIONS 100s of billions predictions daily MODELING AT SCALEVOLUME Petabytes of data used VARIETY Profiles, formats, screens VELOCITY 100k+ requests per second 20 billions events per day VERACITY Robust measurements
  • 5. HOW AUDIENCE IS CREATED
  • 6. IMPRESSION LOG AD SITE COOKIE IMPRESSIONS CLICKS SEGMENTS bmw_X5 forbes.com 13e835610ff0d95 10 1 [a.m, b.rk, c.rh, d.sn, ...] mercedes_2015 forbes.com 13e8360c8e1233d 5 0 [a.f, b.rk, c.hs, d.mr, ...] nokia gizmodo.com 13e3c97d526839c 8 0 [a.m, b.tk, c.hs, d.sn, ...] apple_music reddit.com 1357a253f00c0ac 3 1 [a.m, b.rk, d.sn, e.gh, ...] nokia cnn.com 13b23555294aced 2 1 [a.f, b.tk, c.rh, d.sn, ...] apple_music facebook.com 13e8333d16d723d 9 1 [a.m, d.sn, g.gh, s.hr, ...]
  • 7. SEGMENT EXAMPLES SEGMENT DESCRIPTION a.m Male a.f Female b.tk $75k-$100k annual income b.rk $100k-$150k annual income c.hs High School c.rh College d.sn Single d.mr Married
  • 8. WHAT WE CAN DO WITH DATA What is male/female ratio for people who have seen bmw_X5 ad on forbes.com Income distribution for people who have seen Apple Music ad Nokia click distribution across different education levels
  • 11. SQL? SELECT count(distinct cookie_id) FROM impressions WHERE site = 'forbes.com' AND ad = 'bmw_X5' AND segment contains 'a.m' Infinite combinations Big Data => Big Latency for Hive, Impala and Druid
  • 12. CARDINALITY ESTIMATION ALGORITHMS ACCURACY MEMORY EFFICIENCY ESTIMATE LARGE CARDINALITIES PRACTIALITY For a fixed amount of memory, the algorithm should provide as accurate an estimate as possible. Especially for small cardinalities, the results should be near exact The algorithm should use the available memory efficiently and adapt its memory usage to the cardinality Multisets with cardinalities well beyond 1 billion occur on a daily basis, and it is important that such large cardinalities can be estimated with reasonable accuracy The algorithm should be implementable and maintainable
  • 14. AUDIENCE CARDINALITY APPROXIMATION WITH HYPERLOGLOG Create Audience of people addressed by unique identifiers (cookies) Create Audience “Hash Sum” file with fixed size regardless of audience size Cardinalities ~ 109 with a typical accuracy of 2% using 1.5KB of memory. 1.5KB Create Audience Create Hash
  • 15. HYPERLOGLOG OPERATIONS trait HyperLogLog { def add(cookieId: String): Unit // |A| def cardinality(): Long // |A ∪ B| def merge(other: HyperLogLog): HyperLogLog // |A ∩ B| = |A| + |B| - |A ∪ B|, def intersect(other: HyperLogLog): Long } ∪ ~ merge = 1.5KB 1.5KB 1.5KB ∩ ~ intrsct = 1.5KB 1.5KB 1.5KB | |
  • 17. FROM COOKIES TO HYPERLOGLOG AD SITE COOKIES HLL IMPRESSIONS CLICKS bmw_X5 forbes.com HyperLogLog@23sdg4 5468 35 bmw_X5 cnn.com HyperLogLog@84jdg4 8943 29 SEGMENT COOKIES HLL IMPRESSIONS CLICKS Male HyperLogLog@65xzx2 235468 335 $100k-$150k HyperLogLog@12das1 569473 194
  • 18. DATA FRAMES val adImpressions: DataFrame = sqlContext.parquetFile("/aa/${yy-mm-dd}/${hh}/audience") adImpressions.printSchema() // root // | -- ad: string (nullable = true) // | -- site: string (nullable = true) // | -- hll: binary (nullable = true) // | -- impressions: long (nullable = true) // | -- clicks: long (nullable = true) val segmentImpressions: DataFrame = sqlContext.parquetFile("/aa/${yy-mm-dd}/${hh}/segments") segmentImpressions.printSchema() // root // | -- segment: string (nullable = true) // | -- hll: binary (nullable = true) // | -- impressions: long (nullable = true) // | -- clicks: long (nullable = true)
  • 19. LET’S COUNT SOMETHING import org.apache.spark.sql.functions._ import org.apache.spark.sql.HLLFunctions._ val bmwCookies: HyperLogLog = adImpressions .filter(col("ad") === "bmw_X5") .select(mergeHll(col("hll")).first() // -- sum(clicks) val educatedCookies: HyperLogLog = hllSegments .filter(col("segment") in Seq("College", "High School")) .select(mergeHll(col("hll")).first() val p = (bmwCookies intersect educatedCookies).cardinality() / bmwCookies.cardinality() Percent of college and high school education in BMW campaign?
  • 21. WRITING OWN SPARK AGGREGATION FUNCTIONS case class MergeHLLPartition(child: Expression) extends AggregateExpression with trees.UnaryNode[Expression] { ... } case class MergeHLLMerge(child: Expression) extends AggregateExpression with trees.UnaryNode[Expression] { ... } case class MergeHLL(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] { override def asPartial: SplitEvaluation = { val partial = Alias(MergeHLLPartition(child), "PartialMergeHLL")() SplitEvaluation( MergeHLLMerge(partial.toAttribute), partial :: Nil ) } } def mergeHLL(e: Column): Column = MergeHLL(e.expr) define function that will be applied to each row in RDD partition define function that will take results from different partitions and merge them together tell Spark how you want it to split your computation across RDD
  • 22. AGGREGATION FUNCTIONS PROS & CONS Simple DSL and Native DataFrame look-like functions Works much faster than solving this problem with Scala transformations on top of RDD[Row] Dramatic Performance Speed-Up via mutable state control (10x times) UDF should be part of private Spark package, risk these interfaces might be changed/abandoned in the future.
  • 23. SPARK AS IN-MEMORY SQL DATABASE BATCH-DRIVEN APP LONG-RUNNING APPCHANGE Create SparkContext Run Calculations Destloy SparkContext Show Result Load Data Cache it In memory Receive Request Create SparkContext Show Result Run Calculations ~ 500 GB (1 year history) ~40N occupied from ~200N cluster Response time 1-2 seconds Destloy SparkContext
  • 24. REFERENCES  http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics- with-spark-and-hyperloglog/ (Especial thanks to Eugene Zhulenev for his brilliant blog post)  https://github.com/collectivemedia/spark-hyperloglog  http://research.google.com/pubs/pub40671.html  https://github.com/AdRoll/cantor  http://tech.adroll.com/blog/data/2013/07/10/hll-minhash.html