Automatic Detection of Web Trackers by Vasia Kalavri

Automatic Detection
of Web Trackers
Vasia Kalavri
Apache Flink PMC, PhD student @KTH
kalavri@kth.se, @vkalavri

Telefonica Research, Barcelona
Computer Networks, Multimedia,
Online Social Networks, Security &
Privacy, Recommender Systems,
HCI & Mobile Computing,
Distributed Systems…
2

Ads
Recommendations
Browsing the Web
3

Tracker
Tracker
Ad Server
display relevant
ads
cookie exchange
profiling
Tracking
4

5
The study's authors defined "creepiness" by the feeling
consumers get when they sense an ad is too personal
because it uses data the consumer did not agree to
provide, such as online-search and browsing history.
Consumers are even more creeped out by this because
they don't know how and where that information will
be used.

amazon.com imdb.com facebook.com
X Y
X
Y
IP 1.1.1.1
ID-A = “aaa”
IP 1.1.1.1
ID-X = “xxx”
IP 2.2.2.2
ID-B = “bbb”
IP 2.2.2.2
ID-Y = “yyy”
IP 3.3.3.3
ID-C = “ccc”
IP 3.3.3.3
ID-X = “xxx”
IP 3.3.3.3
ID-Y = “yyy”
Linking Tracker Information
7
Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, Edward W. Felten: Cookies That Give
You Away: The Surveillance Implications of Web Tracking. WWW 2015: 289-299

Can’t we block them?
proxy
Tracker
Tracker
Ad Server
8
Legitimate site

● not frequently updated
● not sure who or based on what criteria URLs are
blacklisted
● miss “hidden” trackers or dual-role nodes
● blocking requires manual matching against the list
● can you buy your way into the whitelist?
Available Solutions
AdBlock, DoNotTrack, EasyPrivacy:
crowd-sourced “black lists” of tracker URLs
9

Our Goal
Exploit fundamental
properties necessary for
tracker operation
Use existing data
to build a trackers classifier
● structural attributes:
connections, network
positions
● operational aspects: data
volume exchange,
communication patterns

Can we detect Trackers automatically?
● Are Trackers similar? How?
○ network structure
○ data received/sent
○ response times
○ latency
● Are Trackers different from normal sites? How?
● Are Trackers mainly connected to other Trackers?
12

The Road to our Goal
● algorithms
● tuning
● features
● combinations of
algorithms and features
and parameters...
13

The Dataset
172.134.23.3 http://www.buzzfeed.com/sheridanwatson/happy-birthday-eva-you
-lucky- gal#.gnJbE8EDDK 3 45 20150203:17080345 34 200 GET www.buzzfeed.
com/ HTTP/1.1 Host: www.google-analytics.com User-Agent: Mozilla/5.0
(Windows; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR
3.5.30729) Accept: text/html,application/xhtml+xml, application/xml;q=0.9,
*/*;q=0.8 Keep-Alive: 300 Connection: keep-alive 234561 34 0 0 ES
#records: ~80m
#users: ~3k
#URLs: ~2m
#Trackers: ~4k
14

Basic Dataset Analysis
● How many requests to
Trackers?
DataSet API
● Do Tracker requests have
larger latency than other
requests?
15
● How many Trackers
○ per user?
○ per request?
○ per website?
● Do popular websites embed
more Trackers than others?
● Do same-topic websites share
Trackers?
● Do different users visiting the
same website end up on
different Trackers?
● Do Trackers send / receive
more / less bytes?
● Do they have more / less
connections on average?

Main Idea
Model the data as a
referer → host bipartite
graph and exploit the
graph structure to identify
Trackers
facebook.com
youtube.com
google-analytics.com
b.scorecardresearch.com
embedded URLsURLs explicitly
visited by the user
16

Attempt#1
Relevance Search
Iterative, random walk-like
algorithm for bipartite graphs
Given an input source node,
assign a “relevance score” to other
nodes, based on how similar their
network position is
Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005). Relevance search
and anomaly detection in bipartite graphs. ACM SIGKDD Explorations
Newsletter,7(2), 48-55.

Relevance Search Algorithm
google-analytics
b.scorecardresearch
xzy/logo_small.jpg
0.9
0.1
source
In each iteration, a vertex:
- sends a score to out-neighbors
- sums up received scores and
updates value
18

Relevance Search Implementation
● single-source relevance search
○ similar to pagerank
○ easily mapped to vertex-centric iterations
● multi-source relevance search
○ each vertex keeps a vector of scores
○ compute top-k relevant nodes per source
○ merge the top-k lists
19
Gelly API

Data Pipeline
top-k
relevant
nodes
www.google-analytics.com: T
www.bscored-research.com:
T
www.facebook.com: NT
www.github.com: NT
cdn.cxense.com: NT
...
Bipartite graph
creation
Multi-source
Relevance Search
Classification
20

Relevance Search Tuning
● How many and which sources to give as input?
● How to define convergence?
● Does initialization matter?
● How to weigh the input graph?
● How to define the relevance score threshold?
21

Relevance Search Problems
● Easy to find the few very similar and the few very
different pages
● Popular trackers are similar to other popular trackers,
but not to not-so-popular ones
● We might keep re-discovering what we already know
22

Relevance Search doesn’t seem to completely solve the problem…
Where do we go now?
23

Attempt#2...N-1
Combining Relevance Search
with other algorithms
Several Clustering algorithms
k-nn Classification
Random Forest

Data Pipeline(s)
top-k
relevant
nodes
www.google-analytics.
com: T
www.bscored-research.
com: T
www.facebook.com:
NT
www.github.com: NT
cdn.cxense.com: NT
...
Bipartite graph
creation
Multi-source
Relevance Search
[feature extraction]
Classification
[your clustering,
classification, etc.
algorithm here]
[evaluation]
25

Automatic Detection of Web Trackers by Vasia Kalavri

r1
r2
r3
r5
r6
r7
h1
h2
h3
h4
h5
h6
h7
h8
NT
NT
T
T
?
T
NT
NT
r4
referer-hosts graph
h2
h3 h4
h5 h6
h8
h7
h1
r1
r2
r3
r3
r3
r4
r5r6
r7
hosts-projection graph
: referer
: non-tracker host
: tracker host
: unlabeled host
The Projection Graph
27

Attempt#N
Community Detection on the
Projection Graph
The Projection Graph captures
implicit connections between
trackers, through other sites
Do Trackers form communities in
the Projection Graph?

● Do they form connected
components?
Basic Analysis of the Projection Graph
● Do Trackers have unusually
high degrees?
DataSet & Gelly APIs
29
● Are they mainly connected to
other Trackers?

Final Data Pipeline
raw logs
cleaned
logs
1: logs pre-
processing
2: bipartite graph
creation
3: largest
connected
component
extraction
4: hosts-
projection graph
creation
5: community
detection
google-analytics.com: T
bscored-research.com: T
facebook.com: NT
github.com: NT
cdn.cxense.com: NT
...
6: results
DataSet API
Gelly
DataSet API
31
Very high accuracy and
very low FPR :-)

Start simple
Lessons Learned
Choose features incrementally
Visualize your data
Re-evaluate your models
Try different data representations
Use a flexible system

Optimizing the Pipeline
Flink Optimizer
to the rescue :-)
34

Automatic Detection of Web Trackers by Vasia Kalavri

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Automatic Detection of Web Trackers by Vasia Kalavri (20)

More from Flink Forward (20)

Recently uploaded (20)

Automatic Detection of Web Trackers by Vasia Kalavri