SlideShare a Scribd company logo
CLICKSTREAM ANALYSIS
WITH APACHE SPARK
Andreas	Zitzelsberger
THE CHALLENGE
ONE POT TO RULE THEM ALL
Web Tracking Ad Tracking
ERP
CRM
▪
Products
▪
Inventory
▪
Margins
▪
Customer
▪
Orders
▪
Creditworthiness
▪
Ad Impressions
▪
Ad Costs
▪
Clicks &
Views
▪
Conversions
ONE POT TO RULE THEM ALL
Retention Reach
Monetarization
steer …
▪ Campaigns
▪ Offers
▪ Contents
REACT ON WEB SITE
TRAFFIC IN REAL TIME
Image: https://www.flickr.com/photos/nick-m/3663923048
SAMPLE RESULTS
Geolocated and gender-
specific conversions.
Frequency of visits
Performance of an ad campaign
THE CONCEPTS
Image: Randy Paulino
THE FIRST SKETCH
(= real-time)
SQL
Clickstream Analysis With Apache Spark
CALCULATING USER JOURNEYS
C V VT VT VT C X
C V
V V V V V V V
C V V C V V V
VT VT V V V VT C
V X
Event stream: User journeys:
Web / Ad tracking
KPIs:
▪ Unique users
▪ Conversions
▪ Ad costs / conversion value
▪ …
V
X
VT
C Click
View
View Time
Conversion
THE ARCHITECTURE
Big Data
„LARRY & FRIENDS“ ARCHITECTURE
Runs not well for more
than 1 TB data in terms of
ingestion speed, query time
and optimization efforts
Image: adweek.com
Nope.
Sorry, no Big Data.
„HADOOP & FRIENDS“ ARCHITECTURE
Aggregation
takes too long
Cumbersome
programming model
(can be solved with
pig, cascading et al.)
Not
interactive
enough
Nope.	
Too	sluggish.
Κ-ARCHITECTURE
Cumbersome
programming model
Over-engineered: We only need
15min real-time ;-)
Stateful aggregations (unique x,
conversions) require a separate DB
with high throughput and fast
aggregations & lookups.
Λ-ARCHITECTURE
Cumbersome
programming model
Complex
architecture
Redundant
logic
FEELS OVER-ENGINEERED…
http://www.brainlazy.com/article/random-nonsense/over-engineered
The Final Architecture*
*) Maybe called µ-architecture one day ;-)
FUNCTIONAL ARCHITECTURE
Strange Events
IngestionRaw Event
Stream
Collection Events Processing Analytics
Warehouse
Fact
Entries
Atomic Event
Frames
Data Lake
Master Data Integration
▪ Buffers load peeks
▪ Ensures message
delivery (fire & forget
for client)
▪ Create user journeys and
unique user sets
▪ Enrich dimensions
▪ Aggregate events to KPIs
▪ Ability to replay for schema
evolution
▪ The representation of truth
▪ Multidimensional data
model
▪ Interactive queries for
actions in realtime and
data exploration
▪ Eternal memory for all
events (even strange
ones)
▪ One schema per event
type. Time partitioned.
▪ Fault tolerant message handling
▪ Event handling: Apply schema, time-partitioning, De-dup, sanity
checks, pre-aggregation, filtering, fraud detection
▪ Tolerates delayed events
▪ High throughput, moderate latency (~ 1min)
SERIAL CONNECTION OF STREAMING AND BATCHING
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
▪ Cool programming model
▪ Uniform dev&ops
▪ Simple solution
▪ High compression ratio due to
column-oriented storage
▪ High scan speed
▪ Cool programming model
▪ Uniform dev&ops
▪ High performance
▪ Interface to R out-of-the-box
▪ Useful libs: MLlib, GraphX, NLP, …
▪ Good connectivity (JDBC,
ODBC, …)
▪ Interactive queries
▪ Uniform ops
▪ Can easily be replaced
due to Hive Metastore
▪ Obvious choice for
cloud-scale messaging
▪ Way the best throughput
and scalability of all
evaluated alternatives
public Map<Long, UserJourney>
sessionize(JavaRDD<AtomicEvent> events) {


return events

// Convert to a pair RDD with the userId as key

.mapToPair(e -> new Tuple2<>(e.getUserId(), e))

// Build user journeys

.<UserJourneyAcc>combineByKey(
UserJourneyAcc::create,
UserJourneyAcc::add,
UserJourneyAcc::combine)

// Convert to a Java map

.collectAsMap();

}
STREAM VERSUS BATCH
https://en.wikipedia.org/wiki/Tanker_(ship)#/media/File:Sirius_Star_2008b.jpghttps://blog.allstate.com/top-5-safety-tips-at-the-gas-pump/
APACHE FLINK
■ Also	has	a	nice,	Spark-like	API	
■ Promises	similar	or	better	
performance	than	spark	
■ Looks	like	the	best	solution	for	a	κ-
Architecture	
■ But	it’s	also	the	newest	kid	on	the	
block
EVENT VERSUS PROCESSING TIME
■ There’s	a	difference	between	even	time	(te)	and	processing	time	
(tp).	
■ Events	arrive	out-of	order	even	during	normal	operation.	
■ Events	may	arrive	arbitrary	late.
Apply	a	grace	period	before	processing	events.
Allow	arbitrary	update	windows	of	metrics.
EXAMPLE
Minute
Hour
Day
Week
Month
Quarter
Year
I
U
U
U
U
U
U
I
U
U
U
U
U
U
U
Resolution	

in	Time
Time
dtp
tp
tp:	 Processing	Time	
ti:	 Ingestion	time	
te:	 Event	Time	
dtp:	 Aggregation	time		
	 frame	
dtw:	 Grace	period	
					:	 Insert	fact	
					:	 Update	fact
dtw
te
ti
LESSONS LEARNED
Image: http://hochmeister-alpin.at
BEST-OF-BREED INSTEAD OF COMMODITY SOLUTIONS
ETL
Analytics
Realtime
Analytics
Slice &
Dice
Data
Exploration
Polyglot Processing
http://datadventures.ghost.io/2014/07/06/polyglot-processing
POLYGLOT ANALYTICS
Data Lake
Analytics
Warehouse
SQL 

lane
R

lane
Timeseries

lane
Reporting Data Exploration
Data Science
NO RETENTION PARANOIA
Data Lake
Analytics
Warehouse
▪ Eternal memory
▪ Close to raw events
▪ Allows replays and refills

into warehouse
Aggressive forgetting with clearly defined 

retention policy per aggregation level like:
▪ 15min:30d
▪ 1h:4m
▪ …
Events
Strange Events
BEWARE OF THE HIPSTERS
Image: h&m
ENSURE YOUR SOFTWARE RUNS LOCALLY
The entire architecture must be able to run locally.
Keep the round trips low for development and
testing.
Throughput and reaction times need to be monitored
continuously. Tune your software and the underlying
frameworks as needed.
TUNE CONTINUOUSLY
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
Load
generator Throughput & latency probes
System, container and process monitoring
IN NUMBERS
Overall dev effort until the first release: 250 person days
Dimensions: 10 KPIs: 26
Integrated 3rd party systems: 7
Inbound data volume per day: 80GB
New data in DWH per day: 2GB
Total price of cheapest cluster which is able to handle production load:
Clickstream Analysis With Apache Spark
THANK YOU
@andreasz82
andreas.zitzelsberger@qaware.de
BONUS SLIDES
CALCULATING UNIQUE USERS
■ We	need	an	exact	unique	user	
count.	
■ If	you	can,	you	should	use	an	
approximation	such	as	
HyperLogLog.
U1
U2
U3
U1
U4
Time
Users
3 UU 2 UU
4 UU
Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm". AOFA ’07: Proceedings of the 2007 International Conference on the
Analysis of Algorithms.
CHARTING TECHNOLOGY
https://github.com/qaware/big-data-landscape
CHOOSING WHERE TO AGGREGATE
Ingestion Event Data Lake Processing Analytics
Warehouse
Fact
Entries
Analytics
Atomic Event
Frames
1 2
3
- Enrichment
- Preprocessing
- Validation
The hard lifting.
- Processing steps that can
be done at query time.
- Interactive queries.

More Related Content

PDF
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
PDF
Turning Events and Big Data into Insight with WSO2 CEP and WSO2 BAM
PDF
How to evolve your analytics stack with your business using Snowplow
PPTX
A taste of Snowplow Analytics data
PPTX
19th February 2013, AWS User Group UK, Meetup #3, The CentraStage Experience,...
PDF
Big data meetup budapest adding data schemas to snowplow
PPTX
Big Data Beers - Introducing Snowplow
PDF
Acting on Real-time Behavior: How Peak Games Won Transactions
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
Turning Events and Big Data into Insight with WSO2 CEP and WSO2 BAM
How to evolve your analytics stack with your business using Snowplow
A taste of Snowplow Analytics data
19th February 2013, AWS User Group UK, Meetup #3, The CentraStage Experience,...
Big data meetup budapest adding data schemas to snowplow
Big Data Beers - Introducing Snowplow
Acting on Real-time Behavior: How Peak Games Won Transactions

What's hot (20)

PDF
Building a real-time, scalable and intelligent programmatic ad buying platform
PPTX
Snowplow Analytics: from NoSQL to SQL and back again
PPTX
Modelling event data in look ml
PDF
Kafka as an Event Store (Guido Schmutz, Trivadis) Kafka Summit NYC 2019
PPTX
Cqrs + event sourcing pyxis v2 - en
PDF
2016 09 measurecamp - event data modeling
PDF
Big Data Expo 2015 - Anchormen Enter the Lambda-architecture
PDF
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
PDF
RedisConf18 - Redis Analytics Use Cases
PDF
Modernising Change - Lime Point - Confluent - Kong
PDF
Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020
PDF
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
PPTX
Understanding event data
PDF
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
PDF
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
PPTX
Spreadsheets To API
PPTX
Analyser vos logs avec Ingensi
PPTX
Snowplow, Metail and Cascalog
PPTX
How we use Hive at SnowPlow, and how the role of HIve is changing
PDF
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
Building a real-time, scalable and intelligent programmatic ad buying platform
Snowplow Analytics: from NoSQL to SQL and back again
Modelling event data in look ml
Kafka as an Event Store (Guido Schmutz, Trivadis) Kafka Summit NYC 2019
Cqrs + event sourcing pyxis v2 - en
2016 09 measurecamp - event data modeling
Big Data Expo 2015 - Anchormen Enter the Lambda-architecture
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
RedisConf18 - Redis Analytics Use Cases
Modernising Change - Lime Point - Confluent - Kong
Keynote: Jay Kreps, Confluent | Kafka ♥ Cloud | Kafka Summit 2020
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Understanding event data
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Spreadsheets To API
Analyser vos logs avec Ingensi
Snowplow, Metail and Cascalog
How we use Hive at SnowPlow, and how the role of HIve is changing
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
Ad

Similar to Clickstream Analysis With Apache Spark (20)

PDF
Harness the power of Data in a Big Data Lake
PDF
Cloud Connect 2012, Big Data @ Netflix
PPTX
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
PPTX
Apache Spark Streaming -Real time web server log analytics
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PPTX
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
PDF
Big Data at a Gaming Company: Spil Games
PDF
PPTX
Big Data, Baby Steps
PDF
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
PDF
Building end to end streaming application on Spark
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
IMCSummit 2015 - Day 2 Developer Track - The Internet of Analytics – Discover...
PDF
Data pipelines from zero to solid
PPTX
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
PDF
Agile data lake? An oxymoron?
PDF
Building a Data Pipeline from Scratch - Joe Crobak
PPTX
Software architecture for data applications
PDF
Stream Computing & Analytics at Uber
PDF
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
Harness the power of Data in a Big Data Lake
Cloud Connect 2012, Big Data @ Netflix
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
Apache Spark Streaming -Real time web server log analytics
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...
Big Data at a Gaming Company: Spil Games
Big Data, Baby Steps
Utah Big Mountain Big Data Baby Steps (4-12-2014) Final
Building end to end streaming application on Spark
Big Data, Ingeniería de datos, y Data Lakes en AWS
IMCSummit 2015 - Day 2 Developer Track - The Internet of Analytics – Discover...
Data pipelines from zero to solid
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
Agile data lake? An oxymoron?
Building a Data Pipeline from Scratch - Joe Crobak
Software architecture for data applications
Stream Computing & Analytics at Uber
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
Ad

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
A Presentation on Artificial Intelligence
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Cloud computing and distributed systems.
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Big Data Technologies - Introduction.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
MIND Revenue Release Quarter 2 2025 Press Release
A Presentation on Artificial Intelligence
Per capita expenditure prediction using model stacking based on satellite ima...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Assigned Numbers - 2025 - Bluetooth® Document
Cloud computing and distributed systems.
Encapsulation_ Review paper, used for researhc scholars
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
sap open course for s4hana steps from ECC to s4
Big Data Technologies - Introduction.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Empathic Computing: Creating Shared Understanding
Programs and apps: productivity, graphics, security and other tools
Chapter 3 Spatial Domain Image Processing.pdf
MYSQL Presentation for SQL database connectivity
Review of recent advances in non-invasive hemoglobin estimation
Digital-Transformation-Roadmap-for-Companies.pptx

Clickstream Analysis With Apache Spark