SlideShare a Scribd company logo
Automatic Detection
of Web Trackers
Vasia Kalavri
Apache Flink PMC, PhD student @KTH
kalavri@kth.se, @vkalavri
Telefonica Research, Barcelona
Computer Networks, Multimedia,
Online Social Networks, Security &
Privacy, Recommender Systems,
HCI & Mobile Computing,
Distributed Systems…
2
Ads
Recommendations
Browsing the Web
3
Tracker
Tracker
Ad Server
display relevant
ads
cookie exchange
profiling
Tracking
4
5
The study's authors defined "creepiness" by the feeling
consumers get when they sense an ad is too personal
because it uses data the consumer did not agree to
provide, such as online-search and browsing history.
Consumers are even more creeped out by this because
they don't know how and where that information will
be used.
6
amazon.com imdb.com facebook.com
X Y
X
Y
IP 1.1.1.1
ID-A = “aaa”
IP 1.1.1.1
ID-X = “xxx”
IP 2.2.2.2
ID-B = “bbb”
IP 2.2.2.2
ID-Y = “yyy”
IP 3.3.3.3
ID-C = “ccc”
IP 3.3.3.3
ID-X = “xxx”
IP 3.3.3.3
ID-Y = “yyy”
Linking Tracker Information
7
Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, Edward W. Felten: Cookies That Give
You Away: The Surveillance Implications of Web Tracking. WWW 2015: 289-299
Can’t we block them?
proxy
Tracker
Tracker
Ad Server
8
Legitimate site
● not frequently updated
● not sure who or based on what criteria URLs are
blacklisted
● miss “hidden” trackers or dual-role nodes
● blocking requires manual matching against the list
● can you buy your way into the whitelist?
Available Solutions
AdBlock, DoNotTrack, EasyPrivacy:
crowd-sourced “black lists” of tracker URLs
9
10
Our Goal
Exploit fundamental
properties necessary for
tracker operation
Use existing data
to build a trackers classifier
● structural attributes:
connections, network
positions
● operational aspects: data
volume exchange,
communication patterns
Can we detect Trackers automatically?
● Are Trackers similar? How?
○ network structure
○ data received/sent
○ response times
○ latency
● Are Trackers different from normal sites? How?
● Are Trackers mainly connected to other Trackers?
12
The Road to our Goal
● algorithms
● tuning
● features
● combinations of
algorithms and features
and parameters...
13
The Dataset
172.134.23.3 http://www.buzzfeed.com/sheridanwatson/happy-birthday-eva-you
-lucky- gal#.gnJbE8EDDK 3 45 20150203:17080345 34 200 GET www.buzzfeed.
com/ HTTP/1.1 Host: www.google-analytics.com User-Agent: Mozilla/5.0
(Windows; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR
3.5.30729) Accept: text/html,application/xhtml+xml, application/xml;q=0.9,
*/*;q=0.8 Keep-Alive: 300 Connection: keep-alive 234561 34 0 0 ES
#records: ~80m
#users: ~3k
#URLs: ~2m
#Trackers: ~4k
14
Basic Dataset Analysis
● How many requests to
Trackers?
DataSet API
● Do Tracker requests have
larger latency than other
requests?
15
● How many Trackers
○ per user?
○ per request?
○ per website?
● Do popular websites embed
more Trackers than others?
● Do same-topic websites share
Trackers?
● Do different users visiting the
same website end up on
different Trackers?
● Do Trackers send / receive
more / less bytes?
● Do they have more / less
connections on average?
Main Idea
Model the data as a
referer → host bipartite
graph and exploit the
graph structure to identify
Trackers
facebook.com
youtube.com
google-analytics.com
b.scorecardresearch.com
embedded URLsURLs explicitly
visited by the user
16
Attempt#1
Relevance Search
Iterative, random walk-like
algorithm for bipartite graphs
Given an input source node,
assign a “relevance score” to other
nodes, based on how similar their
network position is
Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005). Relevance search
and anomaly detection in bipartite graphs. ACM SIGKDD Explorations
Newsletter,7(2), 48-55.
Relevance Search Algorithm
google-analytics
b.scorecardresearch
xzy/logo_small.jpg
0.9
0.1
source
In each iteration, a vertex:
- sends a score to out-neighbors
- sums up received scores and
updates value
18
Relevance Search Implementation
● single-source relevance search
○ similar to pagerank
○ easily mapped to vertex-centric iterations
● multi-source relevance search
○ each vertex keeps a vector of scores
○ compute top-k relevant nodes per source
○ merge the top-k lists
19
Gelly API
Data Pipeline
top-k
relevant
nodes
www.google-analytics.com: T
www.bscored-research.com:
T
www.facebook.com: NT
www.github.com: NT
cdn.cxense.com: NT
...
Bipartite graph
creation
Multi-source
Relevance Search
Classification
20
Relevance Search Tuning
● How many and which sources to give as input?
● How to define convergence?
● Does initialization matter?
● How to weigh the input graph?
● How to define the relevance score threshold?
21
Relevance Search Problems
● Easy to find the few very similar and the few very
different pages
● Popular trackers are similar to other popular trackers,
but not to not-so-popular ones
● We might keep re-discovering what we already know
22
Relevance Search doesn’t seem to completely solve the problem…
Where do we go now?
23
Attempt#2...N-1
Combining Relevance Search
with other algorithms
Several Clustering algorithms
k-nn Classification
Random Forest
Data Pipeline(s)
top-k
relevant
nodes
www.google-analytics.
com: T
www.bscored-research.
com: T
www.facebook.com:
NT
www.github.com: NT
cdn.cxense.com: NT
...
Bipartite graph
creation
Multi-source
Relevance Search
[feature extraction]
Classification
[your clustering,
classification, etc.
algorithm here]
[evaluation]
25
Automatic Detection of Web Trackers by Vasia Kalavri
r1
r2
r3
r5
r6
r7
h1
h2
h3
h4
h5
h6
h7
h8
NT
NT
T
T
?
T
NT
NT
r4
referer-hosts graph
h2
h3 h4
h5 h6
h8
h7
h1
r1
r2
r3
r3
r3
r4
r5r6
r7
hosts-projection graph
: referer
: non-tracker host
: tracker host
: unlabeled host
The Projection Graph
27
Attempt#N
Community Detection on the
Projection Graph
The Projection Graph captures
implicit connections between
trackers, through other sites
Do Trackers form communities in
the Projection Graph?
● Do they form connected
components?
Basic Analysis of the Projection Graph
● Do Trackers have unusually
high degrees?
DataSet & Gelly APIs
29
● Are they mainly connected to
other Trackers?
Visualization
30
Final Data Pipeline
raw logs
cleaned
logs
1: logs pre-
processing
2: bipartite graph
creation
3: largest
connected
component
extraction
4: hosts-
projection graph
creation
5: community
detection
google-analytics.com: T
bscored-research.com: T
facebook.com: NT
github.com: NT
cdn.cxense.com: NT
...
6: results
DataSet API
Gelly
DataSet API
31
Very high accuracy and
very low FPR :-)
Start simple
Lessons Learned
Choose features incrementally
Visualize your data
Re-evaluate your models
Try different data representations
Use a flexible system
Automatic Detection
of Web Trackers
Vasia Kalavri
Apache Flink PMC, PhD student @KTH
kalavri@kth.se, @vkalavri
Optimizing the Pipeline
Flink Optimizer
to the rescue :-)
34

More Related Content

PPTX
Flink Case Study: Capital One
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
PPTX
MongoDB Days Germany: Data Processing with MongoDB
PPTX
Flink Case Study: OKKAM
PPTX
Apache Flink and what it is used for
PDF
Baymeetup-FlinkResearch
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Flink Case Study: Capital One
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
MongoDB Days Germany: Data Processing with MongoDB
Flink Case Study: OKKAM
Apache Flink and what it is used for
Baymeetup-FlinkResearch
Overview of Apache Flink: Next-Gen Big Data Analytics Framework

What's hot (20)

PDF
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
PPTX
Apache Flink community Update for March 2016 - Slim Baltagi
PDF
Bay Area Apache Flink Meetup Community Update August 2015
PPTX
Implementing BigPetStore with Apache Flink
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
PDF
Dogfooding data at Lyft
PDF
Stream Processing: Choosing the Right Tool for the Job
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
PDF
20120907 microbiome-intro
PDF
The Lyft data platform: Now and in the future
PDF
Observability for Data Pipelines With OpenLineage
PDF
Airflow at lyft for Airflow summit 2020 conference
PPTX
Speed layer : Real time views in LAMBDA architecture
PPTX
Functional architectural patterns
PDF
Lambda architecture for real time big data
PDF
Introduction to Apache Apex by Thomas Weise
PPTX
Hadoop and friends
PDF
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
PDF
Fast Data processing with RFX
PDF
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Apache Flink community Update for March 2016 - Slim Baltagi
Bay Area Apache Flink Meetup Community Update August 2015
Implementing BigPetStore with Apache Flink
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Dogfooding data at Lyft
Stream Processing: Choosing the Right Tool for the Job
Open Source Big Data Ingestion - Without the Heartburn!
20120907 microbiome-intro
The Lyft data platform: Now and in the future
Observability for Data Pipelines With OpenLineage
Airflow at lyft for Airflow summit 2020 conference
Speed layer : Real time views in LAMBDA architecture
Functional architectural patterns
Lambda architecture for real time big data
Introduction to Apache Apex by Thomas Weise
Hadoop and friends
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Fast Data processing with RFX
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Ad

Viewers also liked (20)

PPTX
Ted Dunning-Faster and Furiouser- Flink Drift
PPTX
Eron Wright - Introducing Flink on Mesos
PDF
Julian Hyde - Streaming SQL
PDF
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
PPTX
Aljoscha Krettek - The Future of Apache Flink
PDF
Jamie Grier - Robust Stream Processing with Apache Flink
PPTX
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
PDF
Márton Balassi Streaming ML with Flink-
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
PDF
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
PPTX
Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet...
PDF
Gyula Fóra - RBEA- Scalable Real-Time Analytics at King
PDF
Alexander Kolb - Flinkspector – Taming the squirrel
PDF
Trevor Grant - Apache Zeppelin - A friendlier way to Flink
PDF
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
PDF
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
PPTX
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
PDF
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Ted Dunning-Faster and Furiouser- Flink Drift
Eron Wright - Introducing Flink on Mesos
Julian Hyde - Streaming SQL
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Aljoscha Krettek - The Future of Apache Flink
Jamie Grier - Robust Stream Processing with Apache Flink
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Márton Balassi Streaming ML with Flink-
Dongwon Kim – A Comparative Performance Evaluation of Flink
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet...
Gyula Fóra - RBEA- Scalable Real-Time Analytics at King
Alexander Kolb - Flinkspector – Taming the squirrel
Trevor Grant - Apache Zeppelin - A friendlier way to Flink
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Ad

Similar to Automatic Detection of Web Trackers by Vasia Kalavri (20)

PPTX
Clickstream data with spark
PPTX
Data council sf amundsen presentation
PPTX
Strata sf - Amundsen presentation
PPTX
How Lyft Drives Data Discovery
PDF
Disrupting Data Discovery
PDF
Amundsen: From discovering to security data
PPTX
Jeremy cabral search marketing summit - scraping data-driven content (1)
PPTX
Structured Data & Schema.org - SMX Milan 2014
PPT
Data Cloud - Yury Lifshits - Yahoo! Research
PPTX
Automation of (Biological) Data Analysis and Report Generation
PDF
Master in Big Data Analytics and Social Mining 20015
PDF
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
PDF
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
PDF
Anaconda and PyData Solutions
PPTX
Diadem 1.0
PDF
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
PPTX
Social Media Data Collection & Analysis
PDF
Data Discovery and Metadata
PPTX
CML's Presentation at FengChia University
ODP
Web2.0.2012 - lesson 8 - Google world
Clickstream data with spark
Data council sf amundsen presentation
Strata sf - Amundsen presentation
How Lyft Drives Data Discovery
Disrupting Data Discovery
Amundsen: From discovering to security data
Jeremy cabral search marketing summit - scraping data-driven content (1)
Structured Data & Schema.org - SMX Milan 2014
Data Cloud - Yury Lifshits - Yahoo! Research
Automation of (Biological) Data Analysis and Report Generation
Master in Big Data Analytics and Social Mining 20015
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Anaconda and PyData Solutions
Diadem 1.0
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...
Social Media Data Collection & Analysis
Data Discovery and Metadata
CML's Presentation at FengChia University
Web2.0.2012 - lesson 8 - Google world

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Modernizing your data center with Dell and AMD
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Dropbox Q2 2025 Financial Results & Investor Presentation
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Understanding_Digital_Forensics_Presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Empathic Computing: Creating Shared Understanding
CIFDAQ's Market Insight: SEC Turns Pro Crypto
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Modernizing your data center with Dell and AMD
Bridging biosciences and deep learning for revolutionary discoveries: a compr...

Automatic Detection of Web Trackers by Vasia Kalavri

  • 1. Automatic Detection of Web Trackers Vasia Kalavri Apache Flink PMC, PhD student @KTH kalavri@kth.se, @vkalavri
  • 2. Telefonica Research, Barcelona Computer Networks, Multimedia, Online Social Networks, Security & Privacy, Recommender Systems, HCI & Mobile Computing, Distributed Systems… 2
  • 5. 5 The study's authors defined "creepiness" by the feeling consumers get when they sense an ad is too personal because it uses data the consumer did not agree to provide, such as online-search and browsing history. Consumers are even more creeped out by this because they don't know how and where that information will be used.
  • 6. 6
  • 7. amazon.com imdb.com facebook.com X Y X Y IP 1.1.1.1 ID-A = “aaa” IP 1.1.1.1 ID-X = “xxx” IP 2.2.2.2 ID-B = “bbb” IP 2.2.2.2 ID-Y = “yyy” IP 3.3.3.3 ID-C = “ccc” IP 3.3.3.3 ID-X = “xxx” IP 3.3.3.3 ID-Y = “yyy” Linking Tracker Information 7 Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, Jonathan Mayer, Arvind Narayanan, Edward W. Felten: Cookies That Give You Away: The Surveillance Implications of Web Tracking. WWW 2015: 289-299
  • 8. Can’t we block them? proxy Tracker Tracker Ad Server 8 Legitimate site
  • 9. ● not frequently updated ● not sure who or based on what criteria URLs are blacklisted ● miss “hidden” trackers or dual-role nodes ● blocking requires manual matching against the list ● can you buy your way into the whitelist? Available Solutions AdBlock, DoNotTrack, EasyPrivacy: crowd-sourced “black lists” of tracker URLs 9
  • 10. 10
  • 11. Our Goal Exploit fundamental properties necessary for tracker operation Use existing data to build a trackers classifier ● structural attributes: connections, network positions ● operational aspects: data volume exchange, communication patterns
  • 12. Can we detect Trackers automatically? ● Are Trackers similar? How? ○ network structure ○ data received/sent ○ response times ○ latency ● Are Trackers different from normal sites? How? ● Are Trackers mainly connected to other Trackers? 12
  • 13. The Road to our Goal ● algorithms ● tuning ● features ● combinations of algorithms and features and parameters... 13
  • 14. The Dataset 172.134.23.3 http://www.buzzfeed.com/sheridanwatson/happy-birthday-eva-you -lucky- gal#.gnJbE8EDDK 3 45 20150203:17080345 34 200 GET www.buzzfeed. com/ HTTP/1.1 Host: www.google-analytics.com User-Agent: Mozilla/5.0 (Windows; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Accept: text/html,application/xhtml+xml, application/xml;q=0.9, */*;q=0.8 Keep-Alive: 300 Connection: keep-alive 234561 34 0 0 ES #records: ~80m #users: ~3k #URLs: ~2m #Trackers: ~4k 14
  • 15. Basic Dataset Analysis ● How many requests to Trackers? DataSet API ● Do Tracker requests have larger latency than other requests? 15 ● How many Trackers ○ per user? ○ per request? ○ per website? ● Do popular websites embed more Trackers than others? ● Do same-topic websites share Trackers? ● Do different users visiting the same website end up on different Trackers? ● Do Trackers send / receive more / less bytes? ● Do they have more / less connections on average?
  • 16. Main Idea Model the data as a referer → host bipartite graph and exploit the graph structure to identify Trackers facebook.com youtube.com google-analytics.com b.scorecardresearch.com embedded URLsURLs explicitly visited by the user 16
  • 17. Attempt#1 Relevance Search Iterative, random walk-like algorithm for bipartite graphs Given an input source node, assign a “relevance score” to other nodes, based on how similar their network position is Sun, J., Qu, H., Chakrabarti, D., & Faloutsos, C. (2005). Relevance search and anomaly detection in bipartite graphs. ACM SIGKDD Explorations Newsletter,7(2), 48-55.
  • 18. Relevance Search Algorithm google-analytics b.scorecardresearch xzy/logo_small.jpg 0.9 0.1 source In each iteration, a vertex: - sends a score to out-neighbors - sums up received scores and updates value 18
  • 19. Relevance Search Implementation ● single-source relevance search ○ similar to pagerank ○ easily mapped to vertex-centric iterations ● multi-source relevance search ○ each vertex keeps a vector of scores ○ compute top-k relevant nodes per source ○ merge the top-k lists 19 Gelly API
  • 20. Data Pipeline top-k relevant nodes www.google-analytics.com: T www.bscored-research.com: T www.facebook.com: NT www.github.com: NT cdn.cxense.com: NT ... Bipartite graph creation Multi-source Relevance Search Classification 20
  • 21. Relevance Search Tuning ● How many and which sources to give as input? ● How to define convergence? ● Does initialization matter? ● How to weigh the input graph? ● How to define the relevance score threshold? 21
  • 22. Relevance Search Problems ● Easy to find the few very similar and the few very different pages ● Popular trackers are similar to other popular trackers, but not to not-so-popular ones ● We might keep re-discovering what we already know 22
  • 23. Relevance Search doesn’t seem to completely solve the problem… Where do we go now? 23
  • 24. Attempt#2...N-1 Combining Relevance Search with other algorithms Several Clustering algorithms k-nn Classification Random Forest
  • 25. Data Pipeline(s) top-k relevant nodes www.google-analytics. com: T www.bscored-research. com: T www.facebook.com: NT www.github.com: NT cdn.cxense.com: NT ... Bipartite graph creation Multi-source Relevance Search [feature extraction] Classification [your clustering, classification, etc. algorithm here] [evaluation] 25
  • 27. r1 r2 r3 r5 r6 r7 h1 h2 h3 h4 h5 h6 h7 h8 NT NT T T ? T NT NT r4 referer-hosts graph h2 h3 h4 h5 h6 h8 h7 h1 r1 r2 r3 r3 r3 r4 r5r6 r7 hosts-projection graph : referer : non-tracker host : tracker host : unlabeled host The Projection Graph 27
  • 28. Attempt#N Community Detection on the Projection Graph The Projection Graph captures implicit connections between trackers, through other sites Do Trackers form communities in the Projection Graph?
  • 29. ● Do they form connected components? Basic Analysis of the Projection Graph ● Do Trackers have unusually high degrees? DataSet & Gelly APIs 29 ● Are they mainly connected to other Trackers?
  • 31. Final Data Pipeline raw logs cleaned logs 1: logs pre- processing 2: bipartite graph creation 3: largest connected component extraction 4: hosts- projection graph creation 5: community detection google-analytics.com: T bscored-research.com: T facebook.com: NT github.com: NT cdn.cxense.com: NT ... 6: results DataSet API Gelly DataSet API 31 Very high accuracy and very low FPR :-)
  • 32. Start simple Lessons Learned Choose features incrementally Visualize your data Re-evaluate your models Try different data representations Use a flexible system
  • 33. Automatic Detection of Web Trackers Vasia Kalavri Apache Flink PMC, PhD student @KTH kalavri@kth.se, @vkalavri
  • 34. Optimizing the Pipeline Flink Optimizer to the rescue :-) 34