SlideShare a Scribd company logo
USEWOD2012




History and Background of the
   USEWOD Data Challenge

        Knud Möller, Talis Systems Ltd.
              @knudmoeller


                                          1
Möller, K., Hausenblas, M., Cyganiak,
R., Grimnes, G., and Handschuh, S.
(2010). Learning from linked open
data usage: Patterns & metrics. In
WebScience 2010, Raleigh, NC, USA.
http://journal.webscience.org/302/

                                 2
Linked Data


Conventional “Eye-ball” Web   Web of Linked Data

interlinked documents         interlinked items of data
                              (URIs, RDF)

mainly people / Web           mainly machine agents (but
browsers                      also people?)



                                                     3
Linked Data


Conventional “Eye-ball” Web   Web of Linked Data

interlinked documents         interlinked items of data
                              (URIs, RDF)

mainly people / Web           mainly machine agents (but
browsers                      also people?)



                                                     3
How is Linked Data being
                    used?
• plenty of research on conventional Web
  usage
• what about usage of linked data?
Why?
• how healthy is the Web of linked data?
• who is using the data and how? Is it useful?
  Are there trends?
• providers: improve hosting
• ... just curiosity!                     4
Approach

particular sites:
– a URI for each data item ➙ a request for each data item
  (resource)
– content negotiation best practices
– redirection (HTTP 303)




                                                      5
Approach

particular sites:
– a URI for each data item ➙ a request for each data item
  (resource)
– content negotiation best practices
– redirection (HTTP 303)
                              http://data.semanticweb.org/
                                 conference/www/2009


                                            plain
                                        resource URI


        RDF                                                                  HTML
    document URI                                                          document URI
         http://data.semanticweb.org/                  http://data.semanticweb.org/
          conference/www/2009/rdf                      conference/www/2009/html 5
Approach (ctd.)

    • server log files
        – common log format (CLF), combined log format

     Request IP                   Request Date                       Request String


  80.219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0"
       200 64674 "-" "ARC Reader (http://arc.semsol.org/)"


Response Code Responce Size   Referrer   User Agent


      • RDF requests vs. “semantic” requests
•90.21.243.141 − − [06/Oct/2008:16:07:58 +0100] ”GET /organization/vrije
 −universiteit−amsterdam−the−netherlands HTTP/1.1” 303 7592 ”−” ”rdflib
 −2.4.0 (http://rdflib.net/; eikeon@eikeon.com)”
•90.21.243.141 − − [06/Oct/2008:16:08:02 +0100] ”GET /organization/vrije
 −universiteit−amsterdam−the−netherlands/rdf HTTP/1.1” 200 453586 ”−” ”rdflib
 −2.4.0 (http://rdflib.net/; eikeon@eikeon.com)”
219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0"
  200 64674 "-" "ARC Reader (http://arc.semsol.org/)"


nse Code Responce Size     Referrer
                                             Source Data
                                      User Agent

                                            Figure 1: The combined log format


                    # triples     # days      total # hits   # plain hits   # RDF hits     # HTML hits     SPARQL
       Dog Food          79,175       597        8,427,967      1,923,945        259,031       1,647,205      879,932
                                                 (14,117)         (3,223)          (434)         (2,759)      (1,471)
        DBpedia    109,750,000        118       87,203,310     22,821,475      7,008,310      22,999,237   20,972,630
                                                (739,011)      (193,402)       (59,392)       (194,909)    (177,734)
        DBTune      74,209,000        61         7,467,125      1,952,185      1,135,509         677,904    3,055,493
                                                (122,412)       (32,003)       (18,615)        (11,113)     (50,090)
   RKBExplorer      91,501,684        29           529,938             —              —               —         9,327
                                                 (18,274)            (—)            (—)             (—)         (322)


       RDF 5.8%   Semantic 2.8%       Table RDF 14.9% Semantic 4.2% datasets
                                            1: Overview of four LOD                RDF 7.8%    Semantic 2.5%



s are served. For our evaluation, we had access to log
                               Plain 47.7%
                                                                      taining a SPARQL query, we assume that it is
                                                                         Plain 45%                         Plain 41.0%
 two periods: from 24/05/2009–21/06/2009 and from                     ble of handling the query result, i.e., either a
/2009–29/10/2009, i.e., roughly two months.                           bindings (in the case of a SELECT query), pote
                                                                      containing URIs of RDF resources, or an RDF
   RKBExplorer                                                        (in the case of a CONSTRUCT or DESCRIBE q
BExplorer6 [11] is another meta-dataset currently com-
  HTML 46.5%                             HTML 39.9%
  44 sub-datasets covering various topics and sources        • RDF requests: if an agent directly requests
                                                                       HTML 51.1%
 the domain of academic research, as well as a Web             from a server, we assume that it knows how t
ation that allowsDBpedia
                  users to access and browse its content       cess data in this format. 7 Directly here mean
                                                       DBTune the agent specified an RDF syntax such as r
                                                                                    Dog Food
ntegrated fashion. Both RDF and HTML documents
the resources in all datasets are available. Apart from        as an acceptable response in the header of its re
Agents
                                        http://data.semanticweb.org, 21/07/2008 - 20/06/2009
       500000
                                                                                                                 hits




                                         3)
                                    83


                                                             66 8
                                   97                                     ordinary traffic: the usual suspects

                                                           37 23
                                                               )
                               (4


                                                       13 59
       400000
                            ot


                                          (1
                          eB


                                        rp




                                                         )
                                                      28
                      gl


                                    lu
                                           &




                                                                             )
                                                                          11
                                                   89
                     oo


                                   !S




                                                                         92
                                                11
                    G


                            oo




                                                                     (1
                                             t(
                             h




       300000
                                               bo




                                                                    er




                                                                                    5)
                          Ya




                                                                                   32
                                                               ch
                                             sn




                                                                               12
                                                             et
                                          m
hits




                                                          eF




                                                                              r(
                                                                          le
                                                        ic



                                                                          w
                                                     nd



                                                                         ra
                                                  Si



       200000




                                                                                                            )
                                                                    tic




                                                                                                       42
                                                                    ul




                                                                                                        3



                                                                                                                      8)
                                                                                                     (7
                                                               m




                                                                                                                    80
                                                                                                .0



                                                                                                                 (6
                                                                                                /1



                                                                                                                r
                                                                                             ot



                                                                                                             de
                                                                                          fb
       100000




                                                                                                            ea
                                                                                        rd



                                                                                                         R
                                                                                                      C
                                                                                                     R
                                                                                                 A
            0                                                                                                              8
                0              5                  10                  15                20                  25             30
                                                                     agents
semantic hits/total hits (>100 semantic hits)




                                    0
                                        0.2
                                              0.4
                                                    0.6
                                                          0.8
                                                                1
           attributor/1.13.2
                           triplr
                    sindicebot
                   rdflib-2.4.2
                         Ripple
OL_Virtuoso_RDF_crawler
Morph_Converter_Service
                  Falconsbot
                       Speedy
        Slug_SW_Crawler
                       yacybot
         hclsreport-crawler
                      MJ12bot
                      PycURL
              heritrix/1.14.3
            SindiceFetcher
       heritrix/pom.version
                heritrix/2.0.2
                multicrawler
                   SindiceBot
                  ia_archiver
  Zitgist-APlusPlus-Agent
                   rdflib-2.4.1
                                                                              they?




                       Mp3Bot
                            curl
         Zend_Http_Client
            Speedy_Spider
                    nxcrawler
                      marbles
                                -
                          Java
                   rdflib-2.4.0
                   (unknown)
               ARC_Reader
                        MLBot
                        Mozilla
        Jakarta_HttpClient
9




                          Wget
                 libwww-perl
                          MSIE
                        Firefox
                Python-urllib
 sindice_ontology_fetcher
                                                                    Agents: How “semantic” are




                   Googlebot
Demand for LOD?
                                                     DBpedia Hits over Time (smoothing factor 0.05)
300000
                                                                                                                                   plain
                                                                                                                                   html
                                                                                                                                     rdf
250000                                                                                                                          semantic



200000



150000



100000
                                                                                   no increase for semantic requests
 50000



     0
         2009-06-20




                      2009-07-04




                                   2009-07-18




                                                2009-08-01




                                                               2009-08-15




                                                                            2009-08-29




                                                                                         2009-09-12




                                                                                                      2009-09-26




                                                                                                                   2009-10-10




                                                                                                                                     2009-10-24




                                                                                                                                                  2009-11-07
                                                                                                                                       10
Impact of Real-world
                            Events
                                              Irish Lisbon Treaty Referendum (smoothing factor 0.05)
9
                                                                             http://dbpedia.org/resource/Republic_of_Ireland
                                                                                http://dbpedia.org/resource/European_Union
8                                                                               http://dbpedia.org/resource/Treaty_of_Lisbon

7
                 possible impact
6

5

4

3

2

1

0
    2009-06-20



                    2009-07-04



                                 2009-07-18



                                                   2009-08-01



                                                                2009-08-15



                                                                                 2009-08-29



                                                                                              2009-09-12



                                                                                                           2009-09-26



                                                                                                                        2009-10-10



                                                                                                                                     2009-10-24



                                                                                                                                                       2009-11-07
                                                                                                                                                  11
Impact of Real-world
                          Events
                                              Michael Jackson Memorial Service (smoothing factor 0.05)
4.5
                                                                          http://dbpedia.org/resource/Staples_Center
                                                    http://dbpedia.org/resource/Michael_Jackson_memorial_service
 4                                                                      http://dbpedia.org/resource/Michael_Jackson

3.5

 3

2.5

 2

1.5
                                possible impact
 1

0.5

 0
      2009-06-20



                   2009-07-04



                                 2009-07-18



                                                     2009-08-01



                                                                  2009-08-15



                                                                               2009-08-29



                                                                                            2009-09-12



                                                                                                         2009-09-26



                                                                                                                      2009-10-10



                                                                                                                                   2009-10-24



                                                                                                                                                 2009-11-07
                                                                                                                                                12
• this research: one motivation for
  USEWOD
• expand the dataset, encourage more
  and different analyses



                                13
USEWOD Data Challenge 2012

2nd International Workshop on Usage Analysis
             and the Web of Data

       Sponsored by the LATC project
USEWOD Data Challenge
USEWOD Data Challenge


Moving forward by releasing a dataset:
 • to relieve difficulty of obtaining real-life usage
   data
 • to allow comparison and combination of
   approaches done on the same dataset
 • by releasing a relatively new type of logs: usage
   on the Web of Data.
USEWOD Data Challenge


Moving forward by releasing a dataset:
 • to relieve difficulty of obtaining real-life usage
   data
 • to allow comparison and combination of
   approaches done on the same dataset
 • by releasing a relatively new type of logs: usage
   on the Web of Data.
Also for longer-term use.
The USEWOD Dataset 2011


Server logs from two major web of data
  servers:
• DBPedia
 • Several weeks during 2 months of requests
• Semantic Web Dog Food
 • 2 years of requests from Dec 2008 – Dec 2010
USEWOD 2011 Challenge
     Participants
USEWOD 2011 Challenge
               Participants

• At the time of the workshop 11 groups had
  requested the 2011 data
USEWOD 2011 Challenge
               Participants

• At the time of the workshop 11 groups had
  requested the 2011 data
• By now 28.
USEWOD 2011 Challenge
               Participants

• At the time of the workshop 11 groups had
  requested the 2011 data
• By now 28.
• 7 data challenge paper submissions
USEWOD 2011 Challenge
                 Participants

• At the time of the workshop 11 groups had
  requested the 2011 data
• By now 28.
• 7 data challenge paper submissions
• Winner of the 2011 USEWOD data challenge prize:
  • Mario Arias Gallego, Javier D. Fernández, Miguel A.
    Martínez-Prieto and Pablo De La Fuente. An Empirical
    Study of Real-World SPARQL Queries.
USEWOD 2011 Challenge
                 Participants

• At the time of the workshop 11 groups had
  requested the 2011 data
• By now 28.
• 7 data challenge paper submissions
• Winner of the 2011 USEWOD data challenge prize:
  • Mario Arias Gallego, Javier D. Fernández, Miguel A.
    Martínez-Prieto and Pablo De La Fuente. An Empirical
    Study of Real-World SPARQL Queries.
The USEWOD Dataset 2012


Server logs from two major web of data
  servers:
• DBPedia
 • Several weeks during 2 months of requests
• Semantic Web Dog Food
 • 2 years of requests from Dec 2008 – Dec 2010
• Linked Open Geo Data
• Bio2RDF
USEWOD 2012 Challenge
             Participants


• 20 groups requested the data, so far.
• 2 data challenge paper submissions…
• 1 winner of the USEWOD data
  challenge prize.
 • kindly brought to you by LATC
DBpedia
DBpedia
DBpedia
DBpedia
Semantic Web Dog Food

[Screenshots and image take from http://data.semanticweb.org/]
Semantic Web Dog Food

[Screenshots and image take from http://data.semanticweb.org/]
Semantic Web Dog Food

[Screenshots and image take from http://data.semanticweb.org/]
Requests for Human / Machine
        readable Web data


Both servers get requests for RDF
 • http://dbpedia.org/data/Berlin.rdf
as well as HTML
 • http://dbpedia.org/page/Berlin
And requests for the URI itself:
 • http://dbpedia.org/resource/Berlin (will be
   served HTML or RDF)
Requests to SPARQL endpoints

• Both servers have a SPARQL endpoint
  to request RDF data:
  SELECT DISTINCT ?s ?t ?y ?to ?h
  WHERE {
         ?s dc:title ?t .
         ?s swrc:year ?y .
         OPTIONAL {?s foaf:homepage ?h }.
         OPTIONAL {?s foaf:topic ?t }
         }
  order by desc(?y”)
  LIMIT 200
Anonymizing the USEWOD
       Dataset
Anonymizing the USEWOD
               Dataset


• IP addresses:
 • replace all IPs with 0.0.0.0
 • add the country code for the original IP
   address -> track location of requests
 • add an identifier of the original IP -> track
   individual requestors
USEWOD2011, Hydebarabad,
               India

• M. Kirchberg, R. K. L. Ko, and B. S.
  Lee. From linked data to relevant data
  - time is the essence. - http://
  arxiv.org/abs/1103.5046
• M. A. Gallego, J. D. Fernández, M. A.
  Martínez-Prieto, and P. D. L. Fuente.
  An empirical study of real-world
  SPARQL queries. - http://arxiv.org/
  abs/1103.5043                    25
USEWOD2012, Lyon, France


• A. Raghuveer. Characterizing Machine
  Agent Behavior through SPARQL Query
  Mining. - http://ir.ii.uam.es/
  usewod2012/
  usewod2012_raghuveer.pdf
• J. Hoxha, M. Junghans, S. Agarwal.
  Enabling Semantic Analysis of User
  Browsing Patterns in the Web of Data.
  - http://arxiv.org/abs/1204.2713
                                 26
What could be improved?
• does not work well with embedded metadata (e.g.,
  RDFa-based sites)

• does not take into account usage through meta sites
  (indexes, search engines, mirrors, ...)

• does (probably) not take into account usage through
  apps

• what about caches?

• what about bulk/dump downloads of data?

• not enough usage to be interesting yet?       27

More Related Content

PPTX
Inference on the Semantic Web
PPTX
Bio2RDF Release 2: Improved coverage, interoperability and provenance of Link...
PDF
Linked Data Technology and Status
PPTX
BioPAX Models and Pathways
PPTX
The Semantic Web #10 - SPARQL
PDF
Query-Load aware partitioning of RDF data
PPTX
The Semantic Web #4 - RDF (1)
PPTX
Linked Data:Libraries and Beyond
Inference on the Semantic Web
Bio2RDF Release 2: Improved coverage, interoperability and provenance of Link...
Linked Data Technology and Status
BioPAX Models and Pathways
The Semantic Web #10 - SPARQL
Query-Load aware partitioning of RDF data
The Semantic Web #4 - RDF (1)
Linked Data:Libraries and Beyond

What's hot (11)

PDF
RDF Tutorial - SPARQL 20091031
PPTX
Contributing to the Smart City Through Linked Library Data
PPTX
Linked Data Modeling for Beginner
PPTX
The Semantic Web #5 - RDF (2)
KEY
Publishing Linked Open Data in 15 minutes
PPTX
Eswc2012 presentation: Supporting Linked Data Production for Cultural Heritag...
PPTX
Querying Linked Data on Android
PDF
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
PDF
Querying Trust in RDF Data with tSPARQL
PDF
Linking UK Government Data, John Sheridan
PPTX
RDF Tutorial - SPARQL 20091031
Contributing to the Smart City Through Linked Library Data
Linked Data Modeling for Beginner
The Semantic Web #5 - RDF (2)
Publishing Linked Open Data in 15 minutes
Eswc2012 presentation: Supporting Linked Data Production for Cultural Heritag...
Querying Linked Data on Android
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
Querying Trust in RDF Data with tSPARQL
Linking UK Government Data, John Sheridan
Ad

Viewers also liked (7)

PDF
Building a Distributed Secure System on Multi-Agent Platform Depending on the...
KEY
The EU Data Cloud - Introduction
PPTX
Practical Applications of Semantic Web in Retail -- Semtech 2014
PDF
Linked GeoRef
PDF
EU Data Cloud - On to the Cloud
PDF
Digitales Graffiti und vernetzte Daten für digitale Städte
PDF
The Semantic Web (and what it can deliver for your business)
Building a Distributed Secure System on Multi-Agent Platform Depending on the...
The EU Data Cloud - Introduction
Practical Applications of Semantic Web in Retail -- Semtech 2014
Linked GeoRef
EU Data Cloud - On to the Cloud
Digitales Graffiti und vernetzte Daten für digitale Städte
The Semantic Web (and what it can deliver for your business)
Ad

Similar to History and Background of the USEWOD Data Challenge (20)

PPT
Semantic Search overview at SSSW 2012
PDF
Publishing Linked Data from RDB
PDF
Webinar: Semantic web for developers
PDF
ISWC GoodRelations Tutorial Part 4
PDF
GoodRelations Tutorial Part 4
PDF
semantic markup using schema.org
PDF
Semantics, Sensors, and the Social Web
PDF
ISWC GoodRelations Tutorial Part 2
PDF
GoodRelations Tutorial Part 2
PDF
Linked Data Basics
PDF
Open hpi semweb-06-part4
PDF
Some news about the SW
PDF
Питер Мика "Making the web searchable"
PDF
PDF
A Provenance-Aware Linked Data Application for Trip Management and Organization
PDF
The Semantic Web: RPI ITWS Capstone (Fall 2012)
PPSX
The Web of data and web data commons
PDF
Semantic web assignment 3
PPTX
SMX Advanced 2012 - Catching up with the Semantic Web
KEY
Introduction to the Semantic Web
Semantic Search overview at SSSW 2012
Publishing Linked Data from RDB
Webinar: Semantic web for developers
ISWC GoodRelations Tutorial Part 4
GoodRelations Tutorial Part 4
semantic markup using schema.org
Semantics, Sensors, and the Social Web
ISWC GoodRelations Tutorial Part 2
GoodRelations Tutorial Part 2
Linked Data Basics
Open hpi semweb-06-part4
Some news about the SW
Питер Мика "Making the web searchable"
A Provenance-Aware Linked Data Application for Trip Management and Organization
The Semantic Web: RPI ITWS Capstone (Fall 2012)
The Web of data and web data commons
Semantic web assignment 3
SMX Advanced 2012 - Catching up with the Semantic Web
Introduction to the Semantic Web

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
MIND Revenue Release Quarter 2 2025 Press Release
Big Data Technologies - Introduction.pptx
Encapsulation_ Review paper, used for researhc scholars
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
sap open course for s4hana steps from ECC to s4
MYSQL Presentation for SQL database connectivity
Understanding_Digital_Forensics_Presentation.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Programs and apps: productivity, graphics, security and other tools
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx

History and Background of the USEWOD Data Challenge

  • 1. USEWOD2012 History and Background of the USEWOD Data Challenge Knud Möller, Talis Systems Ltd. @knudmoeller 1
  • 2. Möller, K., Hausenblas, M., Cyganiak, R., Grimnes, G., and Handschuh, S. (2010). Learning from linked open data usage: Patterns & metrics. In WebScience 2010, Raleigh, NC, USA. http://journal.webscience.org/302/ 2
  • 3. Linked Data Conventional “Eye-ball” Web Web of Linked Data interlinked documents interlinked items of data (URIs, RDF) mainly people / Web mainly machine agents (but browsers also people?) 3
  • 4. Linked Data Conventional “Eye-ball” Web Web of Linked Data interlinked documents interlinked items of data (URIs, RDF) mainly people / Web mainly machine agents (but browsers also people?) 3
  • 5. How is Linked Data being used? • plenty of research on conventional Web usage • what about usage of linked data? Why? • how healthy is the Web of linked data? • who is using the data and how? Is it useful? Are there trends? • providers: improve hosting • ... just curiosity! 4
  • 6. Approach particular sites: – a URI for each data item ➙ a request for each data item (resource) – content negotiation best practices – redirection (HTTP 303) 5
  • 7. Approach particular sites: – a URI for each data item ➙ a request for each data item (resource) – content negotiation best practices – redirection (HTTP 303) http://data.semanticweb.org/ conference/www/2009 plain resource URI RDF HTML document URI document URI http://data.semanticweb.org/ http://data.semanticweb.org/ conference/www/2009/rdf conference/www/2009/html 5
  • 8. Approach (ctd.) • server log files – common log format (CLF), combined log format Request IP Request Date Request String 80.219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0" 200 64674 "-" "ARC Reader (http://arc.semsol.org/)" Response Code Responce Size Referrer User Agent • RDF requests vs. “semantic” requests •90.21.243.141 − − [06/Oct/2008:16:07:58 +0100] ”GET /organization/vrije −universiteit−amsterdam−the−netherlands HTTP/1.1” 303 7592 ”−” ”rdflib −2.4.0 (http://rdflib.net/; eikeon@eikeon.com)” •90.21.243.141 − − [06/Oct/2008:16:08:02 +0100] ”GET /organization/vrije −universiteit−amsterdam−the−netherlands/rdf HTTP/1.1” 200 453586 ”−” ”rdflib −2.4.0 (http://rdflib.net/; eikeon@eikeon.com)”
  • 9. 219.211.147 - - [23/May/2009:09:52:03 +0100] "GET /sparql?query=PREFIX [..] LIMIT+200 HTTP/1.0" 200 64674 "-" "ARC Reader (http://arc.semsol.org/)" nse Code Responce Size Referrer Source Data User Agent Figure 1: The combined log format # triples # days total # hits # plain hits # RDF hits # HTML hits SPARQL Dog Food 79,175 597 8,427,967 1,923,945 259,031 1,647,205 879,932 (14,117) (3,223) (434) (2,759) (1,471) DBpedia 109,750,000 118 87,203,310 22,821,475 7,008,310 22,999,237 20,972,630 (739,011) (193,402) (59,392) (194,909) (177,734) DBTune 74,209,000 61 7,467,125 1,952,185 1,135,509 677,904 3,055,493 (122,412) (32,003) (18,615) (11,113) (50,090) RKBExplorer 91,501,684 29 529,938 — — — 9,327 (18,274) (—) (—) (—) (322) RDF 5.8% Semantic 2.8% Table RDF 14.9% Semantic 4.2% datasets 1: Overview of four LOD RDF 7.8% Semantic 2.5% s are served. For our evaluation, we had access to log Plain 47.7% taining a SPARQL query, we assume that it is Plain 45% Plain 41.0% two periods: from 24/05/2009–21/06/2009 and from ble of handling the query result, i.e., either a /2009–29/10/2009, i.e., roughly two months. bindings (in the case of a SELECT query), pote containing URIs of RDF resources, or an RDF RKBExplorer (in the case of a CONSTRUCT or DESCRIBE q BExplorer6 [11] is another meta-dataset currently com- HTML 46.5% HTML 39.9% 44 sub-datasets covering various topics and sources • RDF requests: if an agent directly requests HTML 51.1% the domain of academic research, as well as a Web from a server, we assume that it knows how t ation that allowsDBpedia users to access and browse its content cess data in this format. 7 Directly here mean DBTune the agent specified an RDF syntax such as r Dog Food ntegrated fashion. Both RDF and HTML documents the resources in all datasets are available. Apart from as an acceptable response in the header of its re
  • 10. Agents http://data.semanticweb.org, 21/07/2008 - 20/06/2009 500000 hits 3) 83 66 8 97 ordinary traffic: the usual suspects 37 23 ) (4 13 59 400000 ot (1 eB rp ) 28 gl lu & ) 11 89 oo !S 92 11 G oo (1 t( h 300000 bo er 5) Ya 32 ch sn 12 et m hits eF r( le ic w nd ra Si 200000 ) tic 42 ul 3 8) (7 m 80 .0 (6 /1 r ot de fb 100000 ea rd R C R A 0 8 0 5 10 15 20 25 30 agents
  • 11. semantic hits/total hits (>100 semantic hits) 0 0.2 0.4 0.6 0.8 1 attributor/1.13.2 triplr sindicebot rdflib-2.4.2 Ripple OL_Virtuoso_RDF_crawler Morph_Converter_Service Falconsbot Speedy Slug_SW_Crawler yacybot hclsreport-crawler MJ12bot PycURL heritrix/1.14.3 SindiceFetcher heritrix/pom.version heritrix/2.0.2 multicrawler SindiceBot ia_archiver Zitgist-APlusPlus-Agent rdflib-2.4.1 they? Mp3Bot curl Zend_Http_Client Speedy_Spider nxcrawler marbles - Java rdflib-2.4.0 (unknown) ARC_Reader MLBot Mozilla Jakarta_HttpClient 9 Wget libwww-perl MSIE Firefox Python-urllib sindice_ontology_fetcher Agents: How “semantic” are Googlebot
  • 12. Demand for LOD? DBpedia Hits over Time (smoothing factor 0.05) 300000 plain html rdf 250000 semantic 200000 150000 100000 no increase for semantic requests 50000 0 2009-06-20 2009-07-04 2009-07-18 2009-08-01 2009-08-15 2009-08-29 2009-09-12 2009-09-26 2009-10-10 2009-10-24 2009-11-07 10
  • 13. Impact of Real-world Events Irish Lisbon Treaty Referendum (smoothing factor 0.05) 9 http://dbpedia.org/resource/Republic_of_Ireland http://dbpedia.org/resource/European_Union 8 http://dbpedia.org/resource/Treaty_of_Lisbon 7 possible impact 6 5 4 3 2 1 0 2009-06-20 2009-07-04 2009-07-18 2009-08-01 2009-08-15 2009-08-29 2009-09-12 2009-09-26 2009-10-10 2009-10-24 2009-11-07 11
  • 14. Impact of Real-world Events Michael Jackson Memorial Service (smoothing factor 0.05) 4.5 http://dbpedia.org/resource/Staples_Center http://dbpedia.org/resource/Michael_Jackson_memorial_service 4 http://dbpedia.org/resource/Michael_Jackson 3.5 3 2.5 2 1.5 possible impact 1 0.5 0 2009-06-20 2009-07-04 2009-07-18 2009-08-01 2009-08-15 2009-08-29 2009-09-12 2009-09-26 2009-10-10 2009-10-24 2009-11-07 12
  • 15. • this research: one motivation for USEWOD • expand the dataset, encourage more and different analyses 13
  • 16. USEWOD Data Challenge 2012 2nd International Workshop on Usage Analysis and the Web of Data Sponsored by the LATC project
  • 18. USEWOD Data Challenge Moving forward by releasing a dataset: • to relieve difficulty of obtaining real-life usage data • to allow comparison and combination of approaches done on the same dataset • by releasing a relatively new type of logs: usage on the Web of Data.
  • 19. USEWOD Data Challenge Moving forward by releasing a dataset: • to relieve difficulty of obtaining real-life usage data • to allow comparison and combination of approaches done on the same dataset • by releasing a relatively new type of logs: usage on the Web of Data. Also for longer-term use.
  • 20. The USEWOD Dataset 2011 Server logs from two major web of data servers: • DBPedia • Several weeks during 2 months of requests • Semantic Web Dog Food • 2 years of requests from Dec 2008 – Dec 2010
  • 21. USEWOD 2011 Challenge Participants
  • 22. USEWOD 2011 Challenge Participants • At the time of the workshop 11 groups had requested the 2011 data
  • 23. USEWOD 2011 Challenge Participants • At the time of the workshop 11 groups had requested the 2011 data • By now 28.
  • 24. USEWOD 2011 Challenge Participants • At the time of the workshop 11 groups had requested the 2011 data • By now 28. • 7 data challenge paper submissions
  • 25. USEWOD 2011 Challenge Participants • At the time of the workshop 11 groups had requested the 2011 data • By now 28. • 7 data challenge paper submissions • Winner of the 2011 USEWOD data challenge prize: • Mario Arias Gallego, Javier D. Fernández, Miguel A. Martínez-Prieto and Pablo De La Fuente. An Empirical Study of Real-World SPARQL Queries.
  • 26. USEWOD 2011 Challenge Participants • At the time of the workshop 11 groups had requested the 2011 data • By now 28. • 7 data challenge paper submissions • Winner of the 2011 USEWOD data challenge prize: • Mario Arias Gallego, Javier D. Fernández, Miguel A. Martínez-Prieto and Pablo De La Fuente. An Empirical Study of Real-World SPARQL Queries.
  • 27. The USEWOD Dataset 2012 Server logs from two major web of data servers: • DBPedia • Several weeks during 2 months of requests • Semantic Web Dog Food • 2 years of requests from Dec 2008 – Dec 2010 • Linked Open Geo Data • Bio2RDF
  • 28. USEWOD 2012 Challenge Participants • 20 groups requested the data, so far. • 2 data challenge paper submissions… • 1 winner of the USEWOD data challenge prize. • kindly brought to you by LATC
  • 33. Semantic Web Dog Food [Screenshots and image take from http://data.semanticweb.org/]
  • 34. Semantic Web Dog Food [Screenshots and image take from http://data.semanticweb.org/]
  • 35. Semantic Web Dog Food [Screenshots and image take from http://data.semanticweb.org/]
  • 36. Requests for Human / Machine readable Web data Both servers get requests for RDF • http://dbpedia.org/data/Berlin.rdf as well as HTML • http://dbpedia.org/page/Berlin And requests for the URI itself: • http://dbpedia.org/resource/Berlin (will be served HTML or RDF)
  • 37. Requests to SPARQL endpoints • Both servers have a SPARQL endpoint to request RDF data: SELECT DISTINCT ?s ?t ?y ?to ?h WHERE { ?s dc:title ?t . ?s swrc:year ?y . OPTIONAL {?s foaf:homepage ?h }. OPTIONAL {?s foaf:topic ?t } } order by desc(?y”) LIMIT 200
  • 39. Anonymizing the USEWOD Dataset • IP addresses: • replace all IPs with 0.0.0.0 • add the country code for the original IP address -> track location of requests • add an identifier of the original IP -> track individual requestors
  • 40. USEWOD2011, Hydebarabad, India • M. Kirchberg, R. K. L. Ko, and B. S. Lee. From linked data to relevant data - time is the essence. - http:// arxiv.org/abs/1103.5046 • M. A. Gallego, J. D. Fernández, M. A. Martínez-Prieto, and P. D. L. Fuente. An empirical study of real-world SPARQL queries. - http://arxiv.org/ abs/1103.5043 25
  • 41. USEWOD2012, Lyon, France • A. Raghuveer. Characterizing Machine Agent Behavior through SPARQL Query Mining. - http://ir.ii.uam.es/ usewod2012/ usewod2012_raghuveer.pdf • J. Hoxha, M. Junghans, S. Agarwal. Enabling Semantic Analysis of User Browsing Patterns in the Web of Data. - http://arxiv.org/abs/1204.2713 26
  • 42. What could be improved? • does not work well with embedded metadata (e.g., RDFa-based sites) • does not take into account usage through meta sites (indexes, search engines, mirrors, ...) • does (probably) not take into account usage through apps • what about caches? • what about bulk/dump downloads of data? • not enough usage to be interesting yet? 27

Editor's Notes

  • #2: - not so much about USEWOD in general, but more about the challenge data in particular\n
  • #3: \n
  • #4: \n
  • #5: \n
  • #6: - you have a URI for the thing itself, the subject of a document\n- you have different URIs for documents about that thing\n- servers would be set up so that they would give a document back, based on the kind of data that the requesting agent wants\n- that shows up in the server logs\n
  • #7: \n
  • #8: \n
  • #9: \n
  • #10: - ratio of semantic/total hits\n
  • #11: \n
  • #12: \n
  • #13: \n
  • #14: - challenge dataset grew from the dataset used in this paper\n- has been constantly growing since then\n
  • #15: \n
  • #16: \n
  • #17: \n
  • #18: Close to (in the same university) of some of the people behind the Dbpedia project.\nOne of the main drivers of this project.\n
  • #19: Observations about the logs: \nwhich are the most used language elements \nWhat are the characteristics of the triple patterns used?\nVery insightful. \ndesigning query evaluation engines or fine-tuning RDF stores.\n
  • #20: Observations about the logs: \nwhich are the most used language elements \nWhat are the characteristics of the triple patterns used?\nVery insightful. \ndesigning query evaluation engines or fine-tuning RDF stores.\n
  • #21: Observations about the logs: \nwhich are the most used language elements \nWhat are the characteristics of the triple patterns used?\nVery insightful. \ndesigning query evaluation engines or fine-tuning RDF stores.\n
  • #22: Observations about the logs: \nwhich are the most used language elements \nWhat are the characteristics of the triple patterns used?\nVery insightful. \ndesigning query evaluation engines or fine-tuning RDF stores.\n
  • #23: Observations about the logs: \nwhich are the most used language elements \nWhat are the characteristics of the triple patterns used?\nVery insightful. \ndesigning query evaluation engines or fine-tuning RDF stores.\n
  • #24: TODO: Ask knud: the same dbpedia data?\nTODO: ask knud: what time span is the data from?\nOpen Geo Data: OpenStreetMap as RDF\nBio2RDF Linked Data for life sciences\nCredits go to Knud and Markus.\nMarkus is Close to (in the same university) of some of the people behind the Dbpedia project.\nKnud was One of the main drivers of this project.\n
  • #25: Todo: check getallen\n
  • #26: The linked data twin of Wikipedia\n
  • #27: The linked data twin of Wikipedia\n
  • #28: The linked data twin of Wikipedia\n
  • #29: Screenshot of dbpedia\n
  • #30: Screenshot of dbpedia\n
  • #31: \n
  • #32: Todo ander voorbeeld.\n
  • #33: Raw data, but anonymization\n
  • #34: - last year two interesting papers providing analysis of the dataset\n- note: not all USEWOD papers are about the challenge dataset, just like this year\n
  • #35: \n
  • #36: \n