SlideShare a Scribd company logo
FUJITSU RESTRICTED - UK & IRELAND EYES ONLY © Copyright 2014 Fujitsu (Ireland) Limited
Traffic Analytics for
Linked Data Publishers
Luca Costabello, Pierre-Yves Vandenbussche, Gofran Shukair
Fujitsu Ireland
Corine Deliot, Neil Wilson
British Library
2
The Problem: Measuring Traffic on RDF Datasets
Linked Data publishers have limited awareness of how
datasets are accessed by visitors.
No tool to mine Linked Data servers access logs
Why is this such a big deal?
Justify investment in Linked Data IT infrastructure
Cost control
Identify abuses
Interpret access peaks
3
Which traffic metrics?
 Adapt conventional web analytics metrics
 Define Linked-Data specific extensions
How to extract and compute such metrics?
 Which data sources? (client tracking? server access logs mining? both?)
 Need to support dual data access protocol (HTTP operations + SPARQL)
 How to filter noise? (i.e. robots, search engines crawlers)
 How to detect client sessions? (no client tracking, dual data access protocol)
 How to detect SPARQL activity peaks?
Challenges
4
Existing tools do not include Linked Data-specific metrics:
Linked data-specific metrics, but no platform [Moller et al, WebScience 2010]
Filling a Gap in Prior Art
5
• Traffic analytics platform for LD servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
Our Contribution / Agenda
6
• Traffic analytics platform for LD servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
7
8
9
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
10
Metrics
* Linked Data-specific
11
Metrics
* Linked Data-specific
12
Metrics
* Linked Data-specific
13
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
14
Visitor Session Detection
Session: sequence of requests issued with no significant
interruptions by a uniquely identified visitor. Expires after a
period of inactivity.
We use the HAC variant by [Murray et al. 2006, Mehrzadi et al. 2012]
Unsupervised, gap-based session boundary detection
Traditional web logs analysis
Benefit: visitor-specific temporal cut-off
Two-step procedure:
Set visitor-specific session cut-off as time interval that significantly
increases the variance.
Group HTTP/SPARQL requests into sessions according to the cut-off
15
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
16
Heavy/Light SPARQL Queries Binary Classifier
 Rough estimate of heavy and light queries with supervised
binary classification.
 Heavy SPARQL Query: if it requires considerable computational
and memory resources.
Light
Heavy
17
Heavy/Light SPARQL Queries Binary Classifier
 Feature vectors: SPARQL 1.1 syntactic features only:
18
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
19
British National Bibliography access logs
bnb.data.bl.uk (access logs are not public)
13 months
~ 10M HTTP requests/month
DBpedia 3.9 access logs
USEWOD 2015 Dataset
Datasets
20
Visitor Session Detection: Results
How well do we detect the beginning of a new session?
Dataset
British National Bibliography access logs (3 consecutive days)
~16k HTTP/SPARQL requests
• 32% Desktop browsers (115 visitors)
• 68% Software libraries (10 visitors)
Manually annotated records
• 1=session_start | 0=internal
Baseline: fixed-length cut-offs
HAC outperforms fixed-length cut-offs
21
Random distinct queries from DBpedia 3.9 access logs
Run the queries multiple times on local clone of DBpedia
Kept ~3.7k queries with low variance (3.1k light, 600 heavy)
Cut-off threshold: 100ms
Naïve Bayes and SVM
Grid search & randomized search w/ 10-fold CV
Heavy/Light SPARQL: Experiment Protocol
22
Heavy/Light SPARQL: Results
23
Genuine calls account for 0.6% of total traffic!
+30% of HTTP/SPARQL traffic over the observed 13 months
Sharp increase in requests from Software Libraries (95x)
SPARQL accounts for 29% of traffic
6% of heavy SPARQL queries
37 days have unusual traffic spikes
Bounce rate: 48%
Software Libraries have bigger, deeper, and longer sessions.
Some Insights on BL Traffic Logs
24
We relieve publishers from manual and time-consuming
access log mining
Support Linked Data-specific metrics
 Break down traffic by RDF content
 Capture SPARQL insights
 Properly interpret 303 patterns
Reconstruction of Linked Data visitors sessions
Heavy/light SPARQL classifier w/ SPARQL syntax +
supervised learning
Revealed hidden insights on 13 months of access logs of the
British Library
Summary
25
Statistics on noise (i.e. web crawlers)
Heavy/light classifier
 Feature set refinements
 Does it generalize to other datasets?
Enhance session detection with content-based heuristics
 Relatedness of subsequent SPARQL queries
 Structure and type of requested RDF entities
Future Work
26
Public Demo: bit.ly/ld-traffic
innovation.ie.fujitsu.com/kedi

More Related Content

PPT
20101020 harper
PPTX
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...
PPTX
Linked Open Government Data (LOGD)
PDF
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...
PDF
Log aggragation
PPTX
The CIARD RINGValeri
PPTX
Core @ repositories fringe 2015
PPTX
Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus
20101020 harper
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...
Linked Open Government Data (LOGD)
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...
Log aggragation
The CIARD RINGValeri
Core @ repositories fringe 2015
Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

Similar to Traffic Analytics for Linked Data Publishers (20)

PPTX
Data monstersrealtimeetl new
PDF
Southwest Power Pool big data case study
PDF
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
PDF
Building Fast Applications for Streaming Data
PPTX
nstitutional repositories, item and research data metrics
PDF
Introduction to Tideways
PPT
Improving Reporting Performance
PPTX
Introducing the IRUSdataUK pilot webinar
PDF
HUG Ireland Event - HPCC Presentation Slides
PPTX
PPTX
Qo comparision
PDF
Cloud-Scale BGP and NetFlow Analysis
PPTX
Free Netflow analyzer training - diagnosing_and_troubleshooting
PPTX
NetFlow Analyzer Training Part II : Diagnosing and troubleshooting traffic is...
PPTX
Tufts Research: Strategies from Data Management Leaders to Speed Clinical Trials
PPTX
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
PPTX
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
PDF
Web Performance – die effektivsten Techniken aus der Praxis
Data monstersrealtimeetl new
Southwest Power Pool big data case study
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Building Fast Applications for Streaming Data
nstitutional repositories, item and research data metrics
Introduction to Tideways
Improving Reporting Performance
Introducing the IRUSdataUK pilot webinar
HUG Ireland Event - HPCC Presentation Slides
Qo comparision
Cloud-Scale BGP and NetFlow Analysis
Free Netflow analyzer training - diagnosing_and_troubleshooting
NetFlow Analyzer Training Part II : Diagnosing and troubleshooting traffic is...
Tufts Research: Strategies from Data Management Leaders to Speed Clinical Trials
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Web Performance – die effektivsten Techniken aus der Praxis
Ad

More from Luca Costabello (7)

PDF
Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings
PDF
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...
PDF
Context-Aware Access Control and Presentation of Linked Data
PDF
Access Control for HTTP Operations on Linked Data
PDF
Linked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
PPTX
PRISSMA, Towards Mobile Adaptive Presentation of the Web of Data
PPT
Time Based Cluster Analysis for Automatic Blog Generation
Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...
Context-Aware Access Control and Presentation of Linked Data
Access Control for HTTP Operations on Linked Data
Linked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
PRISSMA, Towards Mobile Adaptive Presentation of the Web of Data
Time Based Cluster Analysis for Automatic Blog Generation
Ad

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Leprosy and NLEP programme community medicine
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
A Complete Guide to Streamlining Business Processes
PDF
How to run a consulting project- client discovery
PPTX
Introduction to Inferential Statistics.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Introduction to the R Programming Language
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPT
Predictive modeling basics in data cleaning process
PDF
Introduction to Data Science and Data Analysis
PDF
Global Data and Analytics Market Outlook Report
DOCX
Factor Analysis Word Document Presentation
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Business Analytics and business intelligence.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Leprosy and NLEP programme community medicine
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
IMPACT OF LANDSLIDE.....................
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
A Complete Guide to Streamlining Business Processes
How to run a consulting project- client discovery
Introduction to Inferential Statistics.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Introduction to the R Programming Language
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Predictive modeling basics in data cleaning process
Introduction to Data Science and Data Analysis
Global Data and Analytics Market Outlook Report
Factor Analysis Word Document Presentation
STERILIZATION AND DISINFECTION-1.ppthhhbx
Business Analytics and business intelligence.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

Traffic Analytics for Linked Data Publishers

  • 1. FUJITSU RESTRICTED - UK & IRELAND EYES ONLY © Copyright 2014 Fujitsu (Ireland) Limited Traffic Analytics for Linked Data Publishers Luca Costabello, Pierre-Yves Vandenbussche, Gofran Shukair Fujitsu Ireland Corine Deliot, Neil Wilson British Library
  • 2. 2 The Problem: Measuring Traffic on RDF Datasets Linked Data publishers have limited awareness of how datasets are accessed by visitors. No tool to mine Linked Data servers access logs Why is this such a big deal? Justify investment in Linked Data IT infrastructure Cost control Identify abuses Interpret access peaks
  • 3. 3 Which traffic metrics?  Adapt conventional web analytics metrics  Define Linked-Data specific extensions How to extract and compute such metrics?  Which data sources? (client tracking? server access logs mining? both?)  Need to support dual data access protocol (HTTP operations + SPARQL)  How to filter noise? (i.e. robots, search engines crawlers)  How to detect client sessions? (no client tracking, dual data access protocol)  How to detect SPARQL activity peaks? Challenges
  • 4. 4 Existing tools do not include Linked Data-specific metrics: Linked data-specific metrics, but no platform [Moller et al, WebScience 2010] Filling a Gap in Prior Art
  • 5. 5 • Traffic analytics platform for LD servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight Our Contribution / Agenda
  • 6. 6 • Traffic analytics platform for LD servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 7. 7
  • 8. 8
  • 9. 9 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 13. 13 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 14. 14 Visitor Session Detection Session: sequence of requests issued with no significant interruptions by a uniquely identified visitor. Expires after a period of inactivity. We use the HAC variant by [Murray et al. 2006, Mehrzadi et al. 2012] Unsupervised, gap-based session boundary detection Traditional web logs analysis Benefit: visitor-specific temporal cut-off Two-step procedure: Set visitor-specific session cut-off as time interval that significantly increases the variance. Group HTTP/SPARQL requests into sessions according to the cut-off
  • 15. 15 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 16. 16 Heavy/Light SPARQL Queries Binary Classifier  Rough estimate of heavy and light queries with supervised binary classification.  Heavy SPARQL Query: if it requires considerable computational and memory resources. Light Heavy
  • 17. 17 Heavy/Light SPARQL Queries Binary Classifier  Feature vectors: SPARQL 1.1 syntactic features only:
  • 18. 18 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 19. 19 British National Bibliography access logs bnb.data.bl.uk (access logs are not public) 13 months ~ 10M HTTP requests/month DBpedia 3.9 access logs USEWOD 2015 Dataset Datasets
  • 20. 20 Visitor Session Detection: Results How well do we detect the beginning of a new session? Dataset British National Bibliography access logs (3 consecutive days) ~16k HTTP/SPARQL requests • 32% Desktop browsers (115 visitors) • 68% Software libraries (10 visitors) Manually annotated records • 1=session_start | 0=internal Baseline: fixed-length cut-offs HAC outperforms fixed-length cut-offs
  • 21. 21 Random distinct queries from DBpedia 3.9 access logs Run the queries multiple times on local clone of DBpedia Kept ~3.7k queries with low variance (3.1k light, 600 heavy) Cut-off threshold: 100ms Naïve Bayes and SVM Grid search & randomized search w/ 10-fold CV Heavy/Light SPARQL: Experiment Protocol
  • 23. 23 Genuine calls account for 0.6% of total traffic! +30% of HTTP/SPARQL traffic over the observed 13 months Sharp increase in requests from Software Libraries (95x) SPARQL accounts for 29% of traffic 6% of heavy SPARQL queries 37 days have unusual traffic spikes Bounce rate: 48% Software Libraries have bigger, deeper, and longer sessions. Some Insights on BL Traffic Logs
  • 24. 24 We relieve publishers from manual and time-consuming access log mining Support Linked Data-specific metrics  Break down traffic by RDF content  Capture SPARQL insights  Properly interpret 303 patterns Reconstruction of Linked Data visitors sessions Heavy/light SPARQL classifier w/ SPARQL syntax + supervised learning Revealed hidden insights on 13 months of access logs of the British Library Summary
  • 25. 25 Statistics on noise (i.e. web crawlers) Heavy/light classifier  Feature set refinements  Does it generalize to other datasets? Enhance session detection with content-based heuristics  Relatedness of subsequent SPARQL queries  Structure and type of requested RDF entities Future Work