SlideShare a Scribd company logo
User Behaviour Tracking
Track - Store - Process
!
//Florian Pfeiffer - Head of Data&Infrastructure - gutefrage.net

!
Datageeks
Vision
„Let’s build our own Google
Analytics“
Why

analytics does sampling
we want the (raw) data
Ideas,Thoughts&Goals
fast / minimal impact on page loading time
high availability
track user over multiple platforms
storage engine? -> hbase
Infrastructure
Numbers!
10-20ms Response Time per pixel
record for now: ~2500 concurrent reqs
1,5 billion entries in Hbase
10 Nodes in Hadoop Cluster
Serving Infrastructure

Loadbalancers & RR DNS
nginx with empty_gif module (~2ms)
data is written to logfile
Storing Infrastructure
every nginx node has flume-ng
flume ingests logfile
AsyncHBaseSink with custom Serializer
direct writes to HBase
why flume?

we had it already in production ;)
Storm might be an interesting alternative
HBase rowkey design
Why?

You can scan through all data and use filters
for selecting specific data
But scanning with start & stop row speeds
things up (a lot)
HBase rowkey design
Do I need a fast user or a fast timespan
lookup?
User - clientid,ts<,connectionId>
Timespan - ts,clientid<,connectionId>
Inverse Timestamps
Data in HBase is stored lexicographicaly
sorted
Normal TS - scan would yield oldest results
first
Inverse TS - newer entries come first (and you
can cancel the scan if you have enough data)
Cross Domain Tracking
(Flash)Cookies
Fingerprinting
Etag
HTML5 Storage
The olden times…
or
Cookies

Easy to drop a 3rd party cookie with userId
on different websites
Gets more and more blocked (Safari, FF..)
Fingerprinting
Yields interesting results on desktop, difficult
on e.g. iPhone
invisible to user
Last resort if everything else fails?
Etag

Quite new, based on browser cache
sounds interesting
HTML5 Storage

Store data in local HTML5 storage
Retrieve data with Cross Domain Messaging
Store data

e.g. UserId, SessionId, GeoIP, URL, action,
data
Batch Processing

Calculate how many users are active on
platform A and also on B
Get Traffic of all Questions belonging to
Channel X sorted by Country
Now to something
completely different…
demo
Recommendations
with Myrrix
Myrrix

Evolution: taste -> mahout -> myrrix (-> oryx)
Recommender based on ALS
Recommendations @
GF.net
User emit signals on questions
view, like, gives answer, answer is voted best
Application sends signals through RabbitMQ
to recommendation servers
YEAH
but what happens, when a new user signs up?
?
Fetch data from tracking
and feed it into myrrix
Collecting&Storing data
works great
using & processing is another thing ;)
Datageeks

More Related Content

PPTX
Capacity Planning For Your Growing MongoDB Cluster
PPT
How to Get to Second Base with Your CDN
PPTX
Augmenting Mongo DB with treasure data
PDF
Unifying Events and Logs into the Cloud
PPTX
Teradata QueryGrid to MongoDB Lightning Introduction
PDF
idea: talk about the Active Cache
ODP
Open Source Library Software
PPTX
Google analytics acquisition report - Mahesh Gangurde
Capacity Planning For Your Growing MongoDB Cluster
How to Get to Second Base with Your CDN
Augmenting Mongo DB with treasure data
Unifying Events and Logs into the Cloud
Teradata QueryGrid to MongoDB Lightning Introduction
idea: talk about the Active Cache
Open Source Library Software
Google analytics acquisition report - Mahesh Gangurde

Similar to Datageeks (20)

PDF
Real-time big data analytics based on product recommendations case study
KEY
The data layer
PPTX
Mobile App Analytics. Why, How, What's new - Mar 2019
PPTX
Common MongoDB Use Cases Webinar
PPTX
MediaGlu and Mongo DB
PPTX
Notes on SF W3Conf
PDF
Digital Body Language
PPTX
Nosql Now 2012: MongoDB Use Cases
PDF
Mike king - Digital body language 2.0
PPTX
How companies use NoSQL and Couchbase
PDF
Common MongoDB Use Cases
PPTX
Webtrends Review
PDF
The Web Scale
PPTX
Performance on a budget
PPTX
Betfair + Couchbase
PPTX
How companies-use-no sql-and-couchbase-10152013
PPT
BAQMaR - Conference DM
PDF
2go ScaleConf 2012
 
PDF
Globant and Big Data on AWS
PDF
16h00 globant - aws globant-big-data_summit2012
Real-time big data analytics based on product recommendations case study
The data layer
Mobile App Analytics. Why, How, What's new - Mar 2019
Common MongoDB Use Cases Webinar
MediaGlu and Mongo DB
Notes on SF W3Conf
Digital Body Language
Nosql Now 2012: MongoDB Use Cases
Mike king - Digital body language 2.0
How companies use NoSQL and Couchbase
Common MongoDB Use Cases
Webtrends Review
The Web Scale
Performance on a budget
Betfair + Couchbase
How companies-use-no sql-and-couchbase-10152013
BAQMaR - Conference DM
2go ScaleConf 2012
 
Globant and Big Data on AWS
16h00 globant - aws globant-big-data_summit2012
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Machine learning based COVID-19 study performance prediction
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
Building Integrated photovoltaic BIPV_UPV.pdf
MYSQL Presentation for SQL database connectivity
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Programs and apps: productivity, graphics, security and other tools
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
Machine learning based COVID-19 study performance prediction
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Review of recent advances in non-invasive hemoglobin estimation
Approach and Philosophy of On baking technology
Ad

Datageeks