SlideShare a Scribd company logo
Searching Linked Data
        From Finding Relevant Sources to Computing Answers
        Invited Presentation @ International Workshop on Scalable Semantic Computing,
        Hangzhou, China, November 2010.

        Thanh Tran, Günter Ladwig, Veli Bicer, Lei Zhang, Daniel Herzig, Yongtao
        Ma, Andreas Wagner, Rudi Studer from AIFB Institute, KIT




    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
1                                                            National Laboratory of the Helmholtz Association
Agenda

 Searching Linked Data
      Opportunities & challenges

 Keyword Query Routing
      Problem Definition

      Summary Models

      Experiments

 Linked Data Query Processing
      Combining Top-down & Bottom-up

      Stream-based Query Processing

      Corrective Source Ranking

 Conclusions



     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
 2                                                            National Laboratory of the Helmholtz Association
Linked Data




    - 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links
    - As of 09-2010 + other linked data not covered by LOD cloud
    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
3                                                            National Laboratory of the Helmholtz Association
Opportunities
         “Articles from awarded researchers at Stanford ”




     Freebase contains data about people                       More complex information needs
     DBPedia contains information about awards                 More precise results
     DBLP contains bibliographic data                          More integrated results
    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu              KIT – University of the State of Baden-Wuerttemberg and
4                                                                       National Laboratory of the Helmholtz Association
Problems
         “Articles from awarded researchers at Stanford ”

                       Large number of unknown,
                        unexplored & irrelevant sources!
                                 What is in there?
                                 What is out there?
                                 What is relevant?




    Formulating queries is a hard task!                      Processing queries is expensive!
    • Which data sources?
                 USABILITY                                   • Process against all data sources?
                                                                         SCALABILITY
    • Which schema elements?                                 • Explore all links to other sources?

( z). x, y.prizes(x, Turing Award) worksAt(x,y) name(y,Stanford) publication(x, z)


    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu             KIT – University of the State of Baden-Wuerttemberg and
5                                                                      National Laboratory of the Helmholtz Association
Searching Linked Data

     Given the needs (expressed as sets of keywords),
               are there answers in linked data?
               what combination of data sources produce them?
               how to incorporate related unexplored linked sources?



                  Keyword Query Routing to of
                   Identify valid combination                   Let user choose combination
                   sources Linked Data Sources
                   Relevant                                      of sources
                  Identify schema elements                   Focused,on this combination of
                                                                 Focus Adaptive and Stream-
                                                                 sources and explore related
                                                                    based Linked Data Query
                                                                 linked sources(c.f. LARKC)
                                                                     Processing


    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu        KIT – University of the State of Baden-Wuerttemberg and
6                                                                 National Laboratory of the Helmholtz Association
Agenda

 Searching Linked Data
      Opportunities & challenges

 Keyword Query Routing
      Problem Definition

      Summary Models

      Experiments

 Linked Data Query Processing
      Combining top-down & bottom-up

      Stream-based query processing

      Corrective source ranking

 Conclusions



     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
 7                                                            National Laboratory of the Helmholtz Association
LOD Data Graph
     Web data modeled as a set of interlinked data graphs
     Each data graph represent a source
     Data graph vs. schema graph vs. source graph

                                           Freebase                             DBLP                                           DBPedia
                                                                             …                 John                     Music
                                                                           John.               Smith                    Award
                                                                              title                name                      label

           uni1                          pub2                pub1        pub3               per4                    prize2
                                                                                   author             prizes
                        employ                 author           author

                                       per2                  per1                           per3                    prize1
                                                   sameAs                  sameAs                     prizes
              name                            name                  name                           name                    label

        Stanford                      John                    John                             John                     Turing
        University                   McCarthy                Mccarthy                         McCarthy                  Award


    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                  KIT – University of the State of Baden-Wuerttemberg and
8                                                                                           National Laboratory of the Helmholtz Association
LOD Schema Graph
     Web data modeled as a set of interlinked data graphs
     Each data graph represent a source
     Data graph vs. schema graph vs. source graph

                                           Freebase                        DBLP                                        DBPedia




                                          Written
         University                                          Article
                                           Work
                        employ                 author             author

                                   Person                    Author                    Person                       Prize
                                                   sameAs                  sameAs                    prizes




    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                          KIT – University of the State of Baden-Wuerttemberg and
9                                                                                   National Laboratory of the Helmholtz Association
LOD Source Graph
      Web data modeled as a set of interlinked data graphs
      Each data graph represent a source
      Data graph vs. schema graph vs. source graph

                                            Freebase          DBLP                                        DBPedia




                                                                  author



                                                    sames     sameAs




     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu            KIT – University of the State of Baden-Wuerttemberg and
10                                                                     National Laboratory of the Helmholtz Association
Keyword Query Answers
          User information need                                                 „stanford           article       award“




                                            Freebase                               DBLP                                         DBPedia
                                                                                  …             John                     Music
                                                              Article
                                                                                John.           Smith                    Award
                                                                  type             title            name                      label

            uni1                          pub2                pub1          pub3             per4                    prize2
                                                                                    author             prizes
                         employ                 author             author

                                        per2                    per1                         per3                    prize1
                                                    sameAs                     sameAs                  prizes
               name                            name                     name                        name                    label

         Stanford                      John                     John                            John                     Turing
         University                   McCarthy                 Mccarthy                        McCarthy                  Award


     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                  KIT – University of the State of Baden-Wuerttemberg and
11                                                                                           National Laboratory of the Helmholtz Association
Problem Definition

      Keyword query result (also called Steiner graph) is a
       subgraph of data graph that for every keyword, contains a
       matching data element (called keyword elements), and
       these elements are pairwise connected over a path.

      d-max Steiner graph is a Steiner graph where paths
       between keyword elements is d-max or less.

      Keyword query routing: compute valid set of data sources
       called keyword routing plan. A plan is valid if its union set of
       sources produces non-empty keyword query results.


     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
12                                                            National Laboratory of the Helmholtz Association
A Valid Keyword Routing Plan
          User information need                                                 „stanford           article       award“




                                            Freebase                               DBLP                                         DBPedia
                                                                                  …             John                     Music
                                                              Article
                                                                                John.           Smith                    Award
                                                                  type             title            name                      label

            uni1                          pub2                pub1          pub3             per4                    prize2
                                                                                    author             prizes
                         employ                 author             author

                                        per2                    per1                         per3                    prize1
                                                    sameAs                     sameAs                  prizes
               name                            name                     name                        name                    label

         Stanford                      John                     John                            John                     Turing
         University                   McCarthy                 Mccarthy                        McCarthy                  Award


     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                  KIT – University of the State of Baden-Wuerttemberg and
13                                                                                           National Laboratory of the Helmholtz Association
Agenda

 Searching Linked Data
      Opportunities & challenges

 Keyword Query Routing
      Problem Definition

      Summary Models

      Experiments

 Linked Data Query Processing
      Combining Top-down & Bottom-up

      Stream-based Query Processing

      Corrective Source Ranking

 Conclusions



     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
14                                                            National Laboratory of the Helmholtz Association
Keyword Sets
      One keyword set for every data source
      Elements stand for distinct keywords mentioned in a source


                                            Freebase                             DBLP                                           DBPedia
                                                                              …              John                       Music
                                                                                               Smith                   Music
                                                                            John.            Smith                      Award
                                                                               title                name                      label

            uni1                          pub2                pub1        pub3               per4                    prize2
                                                                                    author             prizes
                                                author           author

                                        per2                  per1                           per3                    prize1
                                                    sameAs                  sameAs                     prizes
                         employ

        Stanford                           John                McCarthy                        John                     Award
                                               name                  name                             name                    label
         Stanford                      John                    John                             John                     Turing
         University                   McCarthy                  John                          McCarthy                  Turing
         University                   McCarthy                Mccarthy                        McCarthy                   Award

     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                  KIT – University of the State of Baden-Wuerttemberg and
16                                                                                           National Laboratory of the Helmholtz Association
Element-level Keyword-Element Relationship Graph (E- KERG)
           A keyword-element captures a keyword k and the data element mentioning k
           A relationship between two keyword-elements exists iff there is a path between
            their associated data elements
           In d-max KERG, the paths to be considered have length d-max or less
                                            Freebase                                 DBLP                                           DBPedia
                                                                       pub4                    per4                    prize2
                                                                                …                 John                      Music
                                                                              John                  Smith                  Music
                                                                              John.              Smith                      Award
                                                                                 title                  name                      label

            uni1                          pub2                 pub1           pub3                 John
                                                                                                per4                       Award
                                                                                                                         prize2
                                                                                      author               prizes
                                                author               author

                                        per2                    per1                            per3                     prize1
                                                    sameAs                     sameAs                      prizes
                         employ
     uni1                    per2                             per1                             per3                      prize1
         Stanford                          John                 McCarthy                              John                  Award
                                               name                    name                               name                    label
         Stanford                      John                      John                               John                      Turing
         University                   McCarthy                    John                            McCarthy                   Turin
         University                   McCarthy                  Mccarthy                          McCarthy                    Award
     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                      KIT – University of the State of Baden-Wuerttemberg and
17                                                                                               National Laboratory of the Helmholtz Association
Schema-level Keyword-Element Relationship Graph (S-KERG)
           A keyword-element captures a keyword k and the schema element which contains
            some instances (date elements) mentioning k
           A relationship between two keyword-elements exists if there is a path between some
            instances of their associated schema elements
           Groups ele. (rel.) when they capture same keyword (rel. between same classes)
                                            Freebase                                 DBLP                                            DBPedia
                                                                        Article
                                                                        pub4                    Person
                                                                                                per4                    Prize
                                                                                                                        prize2
                                                                                 …                 John                      Music
                                                                               John                  Smith                 Music
                                                                               John.              Smith                      Award
                                                                                    title                name                      label

            uni1                          pub2                  pub1         pub3                   John
                                                                                                 per4                       Award
                                                                                                                          prize2
                                                                                       author               prizes
                                                author             author

                                        per2                     per1                            per3                     prize1
                                                    sameAs                        sameAs                    prizes
                         employ
     University
     uni1                    Person
                             per2                             Author
                                                               per1                             per3                      prize1
         Stanford                          John                  McCarthy                               John                 Award
                                               name                     name                               name                    label
         Stanford                      John                      John                                John                      Turing
         University                   McCarthy                   John                              McCarthy                   Turin
         University                   McCarthy                  Mccarthy                           McCarthy                    Award
     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                       KIT – University of the State of Baden-Wuerttemberg and
18                                                                                                National Laboratory of the Helmholtz Association
Data-Source-level Keyword-Element Relationship Graph (D-KERG)
           A keyword-element captures a keyword k and the source which contains some
            instances (date elements) mentioning k
           A relationship between two keyword-elements exists if there is a path between
            some instances of their associated sources
           Groups ele. (rel.) when they capture same keyword (rel. between same sources)
                                            Freebase                                 DBLP                                            DBPedia
                                                                        Article
                                                                        pub4                    Person
                                                                                                per4                    Prize
                                                                                                                        prize2
                                                                                 …                 John                      Music
                                                                               John                  Smith                 Music
                                                                               John.              Smith                      Award
                                                                                    title                name                      label

            uni1                          pub2                  pub1         pub3                   John
                                                                                                 per4                       Award
                                                                                                                          prize2
                                                                                       author               prizes
                                                author             author

                                        per2                     per1                            per3                     prize1
                                                    sameAs                        sameAs                    prizes
                         employ
     University
     uni1                    Person
                             per2                             Author
                                                               per1                             per3                      prize1
         Stanford                          John                  McCarthy                               John                 Award
                                               name                     name                               name                    label
         Stanford                      John                      John                                John                      Turing
         University                   McCarthy                   John                              McCarthy                   Turin
         University                   McCarthy                  Mccarthy                           McCarthy                    Award
     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                       KIT – University of the State of Baden-Wuerttemberg and
19                                                                                                National Laboratory of the Helmholtz Association
Agenda

 Searching Linked Data
      Opportunities & challenges

 Keyword Query Routing
      Problem Definition

      Summary Models

      Experiments

 Linked Data Query Processing
      Combining Top-down & Bottom-up

      Stream-based Query Processing

      Corrective Source Ranking

 Conclusions



     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
21                                                            National Laboratory of the Helmholtz Association
Experiments

      Chunk of the BTC dataset containing 10M RDF
       triples from 154 sources, linked via 500K mappings

      Manually crafted 30 keyword valid multi-data-
       source queries, i.e., produce non-empty keyword
       answers and involve more than 2 sources
               Town River America
               Beijing Conference Database 2007




     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
22                                                            National Laboratory of the Helmholtz Association
Validity
                P@k measure the percentage of plans that are valid out of the top-k plans
                P@5 for KS only 6%, P@5 up to 100% for E-KERG (dmax =4)
                More valid plans were computed when a higher value was used for dmax
                dmax =3 seems to be a good tradeoff
                Queries with larger number of keywords resulted in lower precision


           1.0                                                                1.0
                                                                                                     E-KERG           D-KERG
                         E-KERG
           0.9                                                                0.9
                         D-KERG                                                                      S-KERG           KS
           0.8                                                                0.8

           0.7           S-KERG                                               0.7
           0.6           KS                                                   0.6
                                                                        P@5
     P@5




           0.5                                                                0.5
           0.4                                                                0.4
           0.3                                                                0.3
           0.2                                                                0.2
           0.1                                                                0.1
           0.0                                                                0.0
                     0            1             2               3   4               2          3              4            5
                                             dmax                                                   |K|
       Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                           KIT – University of the State of Baden-Wuerttemberg and
23                                                                                      National Laboratory of the Helmholtz Association
Performance
                Times increased with higher values for dmax
                                       Sharp for E-KERG and S-KERG
                                       Relatively stable for D-KERG
                Times increase with number of keywords
                                       All other models had poor performance w.r.t complex queries but D-KERG
                                       E-KERG needed more than 100s for queries with more than 2 keywords
                Time for D-KERG was no more than 10ms on average

                                       S-KERG       D-KERG        KS       E-KERG                                              S-KERG       D-KERG       KS       E-KERG

                                  1000000                                                                            1000000
     Query Processing Time (ms)




                                                                                        Query Processing Time (ms)
                                   100000                                                                             100000

                                    10000                                                                              10000

                                     1000                                                                               1000

                                      100                                                                                100

                                       10                                                                                 10
                                        1
                                                                                                                           1
                                                0     1       2        3            4
                                                                                                                                        2        3            4            5
                                                             dmax
                                                                                                                                                      |K|

               Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                                            KIT – University of the State of Baden-Wuerttemberg and
24                                                                                                                               National Laboratory of the Helmholtz Association
Agenda

 Searching Linked Data
      Opportunities & challenges

 Keyword Query Routing
      Problem Definition

      Summary Models

      Experiments

 Linked Data Query Processing
      Combining Top-down & Bottom-up

      Stream-based Query Processing

      Corrective Source Ranking

 Conclusions



     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
27                                                            National Laboratory of the Helmholtz Association
Mixed Query Processing Strategy

 Combination of top-down and bottom-up
  strategies
            Top-down: partial local index of sources, not assumed to
             be complete
            Bottom-up: new sources are discovered at run-time
 Corrective Source Ranking
            Deal with heterogeneous source descriptions
            Adaptive re-ranking
 Stream-based Query Processing
            Deal with unpredictable nature of Linked Data access


Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
ISWC 2010, Shanghai, China                               National Laboratory of the Helmholtz Association
Agenda

 Searching Linked Data
      Opportunities & challenges

 Keyword Query Routing
      Problem Definition

      Summary Models

      Experiments

 Linked Data Query Processing
      Combining Top-down & Bottom-up

      Stream-based Query Processing

      Corrective Source Ranking

 Conclusions



     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
29                                                            National Laboratory of the Helmholtz Association
Stream-based Query Processing
                                                                                                  Results
     Compile-time
          Construct query plan                          Query Plan                            Join

          Probe local index for
           sources                                                                Join                        name(?y, ?n)
     Network latency
                Do not block!                            worksAt(?x, dbpedia:KIT)            knows(?x, ?y)
                Evaluation driven by                                                                                                   Samples
                 incoming data
     Run-time                                                                         Push
                Retrieve sources                                     Source Retrieval                    Retrieve            Source Ranker
                Push data into query plan                               Source Retriever 1
                                                                                                          source
                                                                                                                             Source 1 (score: 1.0)
                Discover new sources                                    Source Retriever 2                Source            Source 2 (score: 0.7)
                                                                                                         discovered                  ...
                Rank sources                                                    ...




                                                                               Linked                                                Local
                                                                                Data                                                source
                                                                                                                                     index

Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                        KIT – University of the State of Baden-Wuerttemberg and
ISWC 2010, Shanghai, China                                                                    National Laboratory of the Helmholtz Association
Agenda

 Searching Linked Data
      Opportunities & challenges

 Keyword Query Routing
      Problem Definition

      Summary Models

      Experiments

 Linked Data Query Processing
      Combining Top-down & Bottom-up

      Stream-based Query Processing

      Corrective Source Ranking

 Conclusions



     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
31                                                            National Laboratory of the Helmholtz Association
Corrective Source Ranking

 Prefer more relevant sources
 Relevancy of a source is based on
            Current query
            Any available intermediate results
            Overall optimization goal
 Define a set of source features and derive concrete
  source metrics
            Not all metrics are available for all sources (heterogeneity)
 Refine previously computed metrics using newly
  discovered information (intermediate results, samples)

Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
ISWC 2010, Shanghai, China                               National Laboratory of the Helmholtz Association
Evaluation

 Three systems: top-down (TD), bottom-up (BU), mixed (MI)
 8 queries over various datasets (DBpedia, Geonames, NYT)
 To make the approaches comparable, sources were restricted
  to those discoverable by the BU approach
 ~6200 sources, containing ~500k triples
            Sources hosted on local proxy server with artificial delay of 2 seconds
            25% of sources were randomly chosen to construct index for MI




Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
ISWC 2010, Shanghai, China                               National Laboratory of the Helmholtz Association
Results
    Overall early result reporting
        25% results: MI 8.7s, BU 15.1s
        50% results: MI 12.8s, BU 22.0s
        Improvement of ~42%
    Detailed results for two queries:

                                                         Query 1                             Query 6
                                      BU                  MI        TD         BU                  MI                TD
25% Results                       24810.5                10300.0   11038.0     8222.5             4743.5            5545.0
50% Results                       43464.5                40782.0   15787.0   10961.5              7650.5            5634.0
Total                             84066.5                86895.5   44323.5   24086.0            20711.0           16469.0
Src. Selection                              0.0            853.0    1444.5           0.0          1331.0            1863.5
Ranking                                   25.5            2404.0     411.5         23.5             292.5             335.0


Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                       KIT – University of the State of Baden-Wuerttemberg and
ISWC 2010, Shanghai, China                                                   National Laboratory of the Helmholtz Association
Result Arrival Times




Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
ISWC 2010, Shanghai, China                               National Laboratory of the Helmholtz Association
Agenda

 Searching Linked Data
      Opportunities & challenges

 Keyword Query Routing
      Problem Definition

      Summary Models

      Experiments

 Linked Data Query Processing
      Combining Top-down & Bottom-up

      Stream-based Query Processing

      Corrective Source Ranking

 Conclusions



     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
39                                                            National Laboratory of the Helmholtz Association
Conclusions
      Keyword query routing
         Helps users without knowledge of linked data and schemas to
           find combination of sources that contain answers
           corresponding to their needs
         Focus on relevant combinations
      Summarizing at the level of sources (D-KERG) represents the
       most practical trade-off, produces results in less than 10ms out of
       which every second one was valid
      Stream-based query processing helps to deal with
       unpredictable nature of Linked data
      Corrective, mixed strategy that incorporate new sources and
       knowledge at run-time for optimization (source ranking) helped to
       report early results 42% faster on average




     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
40                                                            National Laboratory of the Helmholtz Association
Thanks for Your Attention!

                                                                  Institute AIFB, KIT

                                                               ducthanh.tran@kit.edu




     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                             KIT – University of the State of Baden-Wuerttemberg and
41                                                                                      National Laboratory of the Helmholtz Association

More Related Content

PPTX
Summary Models for Routing Keywords to Linked Data Sources
PDF
Linked Data and Sevices
PDF
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
PPTX
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
PPTX
Text Data Mining
PDF
E profiles 1
DOCX
NE7012- SOCIAL NETWORK ANALYSIS
PPT
Beyond Transparency: Success & Lessons From tambisBoston2003
Summary Models for Routing Keywords to Linked Data Sources
Linked Data and Sevices
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Text Data Mining
E profiles 1
NE7012- SOCIAL NETWORK ANALYSIS
Beyond Transparency: Success & Lessons From tambisBoston2003

What's hot (20)

PDF
SemFacet paper
PPTX
Linked Data at the Open University: From Technical Challenges to Organization...
PPTX
Doing Clever Things with the Semantic Web
PPTX
Text data mining1
PDF
ITWS Capstone Lecture (Spring 2013)
PDF
The Semantic Web: RPI ITWS Capstone (Fall 2012)
PDF
Data Interlinking
PDF
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
PPTX
Interpreting Data Mining Results with Linked Data for Learning Analytics
PPTX
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
PPTX
Semantic Web, Linked Data and Education: A Perfect Fit?
PDF
Profile-based Dataset Recommendation for RDF Data Linking
PPTX
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
PPTX
Ontology mapping for the semantic web
PDF
Verifying Integrity Constraints of a RDF-based WordNet
PDF
Some Information Retrieval Models and Our Experiments for TREC KBA
PDF
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
PDF
Linking Universities - A broader look at the application of linked data and s...
PDF
Connecting life sciences data at the European Bioinformatics Institute
PDF
Chinese Word Segmentation in MSR-NLP
SemFacet paper
Linked Data at the Open University: From Technical Challenges to Organization...
Doing Clever Things with the Semantic Web
Text data mining1
ITWS Capstone Lecture (Spring 2013)
The Semantic Web: RPI ITWS Capstone (Fall 2012)
Data Interlinking
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Interpreting Data Mining Results with Linked Data for Learning Analytics
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Semantic Web, Linked Data and Education: A Perfect Fit?
Profile-based Dataset Recommendation for RDF Data Linking
ABSTAT: Ontology-driven Linked Data Summaries with Pattern Minimalization
Ontology mapping for the semantic web
Verifying Integrity Constraints of a RDF-based WordNet
Some Information Retrieval Models and Our Experiments for TREC KBA
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Linking Universities - A broader look at the application of linked data and s...
Connecting life sciences data at the European Bioinformatics Institute
Chinese Word Segmentation in MSR-NLP
Ad

Viewers also liked (17)

PPTX
Keyword Search on Structured Data using Relevance Models
PPTX
Recent Trends in Semantic Search Technologies
PPTX
Linked Data Query Processing Strategies
PPTX
Index Structures and Top-k Joins for Native Keyword Search Databases
PPT
Lifecycle support in architectures for ontology-based information systems - iswc
PDF
Гастро-тур в Италию
PDF
Semantic Search Tutorial at SemTech 2012
PPTX
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
PPTX
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
PPTX
Graphinder semantic search
PPTX
поляризация диэлектриков
PPTX
Query Processing Using Structure Index for RDF Data on the Web
PPTX
Big data search
PPTX
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
PPTX
Semantic Web Search - Searching Documents and Semantic Data on the Web
PPTX
From Expert Finding to Entity Search on the Web
PPTX
ESSIR 2011 Semantic Search Tutorial
Keyword Search on Structured Data using Relevance Models
Recent Trends in Semantic Search Technologies
Linked Data Query Processing Strategies
Index Structures and Top-k Joins for Native Keyword Search Databases
Lifecycle support in architectures for ontology-based information systems - iswc
Гастро-тур в Италию
Semantic Search Tutorial at SemTech 2012
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
Graphinder semantic search
поляризация диэлектриков
Query Processing Using Structure Index for RDF Data on the Web
Big data search
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Semantic Web Search - Searching Documents and Semantic Data on the Web
From Expert Finding to Entity Search on the Web
ESSIR 2011 Semantic Search Tutorial
Ad

Similar to Searching Linked Data (20)

PPTX
How the Web can change social science research (including yours)
PPTX
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
PPT
2011linked science4mccuskermcguinnessfinal
PDF
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
PPTX
Quantifying the bias in data links
PPTX
DataCite: the Perfect Complement to CrossRef
PDF
Mendeley: Recommendation Systems for Academic Literature
PPT
2006-05-25__coi-semdis
PDF
Data Tactics Data Science Brown Bag (April 2014)
PDF
Data Science and Analytics Brown Bag
PDF
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
PPTX
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
PDF
Mid-Ontology Learning from Linked Data @JIST2011
PDF
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
PDF
Mendeley, putting data into the hands of researchers
PDF
Scalable and privacy-preserving data integration - part 1
PPTX
Remembrance of data past
PDF
Linking Open Government Data at Scale
PDF
The web of interlinked data and knowledge stripped
PPT
Biological databases
How the Web can change social science research (including yours)
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
2011linked science4mccuskermcguinnessfinal
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Quantifying the bias in data links
DataCite: the Perfect Complement to CrossRef
Mendeley: Recommendation Systems for Academic Literature
2006-05-25__coi-semdis
Data Tactics Data Science Brown Bag (April 2014)
Data Science and Analytics Brown Bag
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Mid-Ontology Learning from Linked Data @JIST2011
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
Mendeley, putting data into the hands of researchers
Scalable and privacy-preserving data integration - part 1
Remembrance of data past
Linking Open Government Data at Scale
The web of interlinked data and knowledge stripped
Biological databases

Recently uploaded (20)

PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
RMMM.pdf make it easy to upload and study
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Presentation on HIE in infants and its manifestations
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Institutional Correction lecture only . . .
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
RMMM.pdf make it easy to upload and study
Supply Chain Operations Speaking Notes -ICLT Program
Microbial diseases, their pathogenesis and prophylaxis
Presentation on HIE in infants and its manifestations
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Microbial disease of the cardiovascular and lymphatic systems
Abdominal Access Techniques with Prof. Dr. R K Mishra
VCE English Exam - Section C Student Revision Booklet
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Final Presentation General Medicine 03-08-2024.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Complications of Minimal Access Surgery at WLH
Institutional Correction lecture only . . .
human mycosis Human fungal infections are called human mycosis..pptx
Chinmaya Tiranga quiz Grand Finale.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Final Presentation General Medicine 03-08-2024.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student

Searching Linked Data

  • 1. Searching Linked Data From Finding Relevant Sources to Computing Answers Invited Presentation @ International Workshop on Scalable Semantic Computing, Hangzhou, China, November 2010. Thanh Tran, Günter Ladwig, Veli Bicer, Lei Zhang, Daniel Herzig, Yongtao Ma, Andreas Wagner, Rudi Studer from AIFB Institute, KIT Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 1 National Laboratory of the Helmholtz Association
  • 2. Agenda  Searching Linked Data  Opportunities & challenges  Keyword Query Routing  Problem Definition  Summary Models  Experiments  Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking  Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 2 National Laboratory of the Helmholtz Association
  • 3. Linked Data - 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links - As of 09-2010 + other linked data not covered by LOD cloud Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 3 National Laboratory of the Helmholtz Association
  • 4. Opportunities “Articles from awarded researchers at Stanford ”  Freebase contains data about people  More complex information needs  DBPedia contains information about awards  More precise results  DBLP contains bibliographic data  More integrated results Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 4 National Laboratory of the Helmholtz Association
  • 5. Problems “Articles from awarded researchers at Stanford ”  Large number of unknown, unexplored & irrelevant sources!  What is in there?  What is out there?  What is relevant? Formulating queries is a hard task! Processing queries is expensive! • Which data sources? USABILITY • Process against all data sources? SCALABILITY • Which schema elements? • Explore all links to other sources? ( z). x, y.prizes(x, Turing Award) worksAt(x,y) name(y,Stanford) publication(x, z) Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 5 National Laboratory of the Helmholtz Association
  • 6. Searching Linked Data  Given the needs (expressed as sets of keywords),  are there answers in linked data?  what combination of data sources produce them?  how to incorporate related unexplored linked sources?  Keyword Query Routing to of Identify valid combination  Let user choose combination sources Linked Data Sources Relevant of sources  Identify schema elements  Focused,on this combination of Focus Adaptive and Stream- sources and explore related based Linked Data Query linked sources(c.f. LARKC) Processing Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 6 National Laboratory of the Helmholtz Association
  • 7. Agenda  Searching Linked Data  Opportunities & challenges  Keyword Query Routing  Problem Definition  Summary Models  Experiments  Linked Data Query Processing  Combining top-down & bottom-up  Stream-based query processing  Corrective source ranking  Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 7 National Laboratory of the Helmholtz Association
  • 8. LOD Data Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Data graph vs. schema graph vs. source graph Freebase DBLP DBPedia … John Music John. Smith Award title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 8 National Laboratory of the Helmholtz Association
  • 9. LOD Schema Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Data graph vs. schema graph vs. source graph Freebase DBLP DBPedia Written University Article Work employ author author Person Author Person Prize sameAs sameAs prizes Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 9 National Laboratory of the Helmholtz Association
  • 10. LOD Source Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Data graph vs. schema graph vs. source graph Freebase DBLP DBPedia author sames sameAs Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 10 National Laboratory of the Helmholtz Association
  • 11. Keyword Query Answers User information need „stanford article award“ Freebase DBLP DBPedia … John Music Article John. Smith Award type title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 11 National Laboratory of the Helmholtz Association
  • 12. Problem Definition  Keyword query result (also called Steiner graph) is a subgraph of data graph that for every keyword, contains a matching data element (called keyword elements), and these elements are pairwise connected over a path.  d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less.  Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its union set of sources produces non-empty keyword query results. Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 12 National Laboratory of the Helmholtz Association
  • 13. A Valid Keyword Routing Plan User information need „stanford article award“ Freebase DBLP DBPedia … John Music Article John. Smith Award type title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 13 National Laboratory of the Helmholtz Association
  • 14. Agenda  Searching Linked Data  Opportunities & challenges  Keyword Query Routing  Problem Definition  Summary Models  Experiments  Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking  Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 14 National Laboratory of the Helmholtz Association
  • 15. Keyword Sets  One keyword set for every data source  Elements stand for distinct keywords mentioned in a source Freebase DBLP DBPedia … John Music Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 16 National Laboratory of the Helmholtz Association
  • 16. Element-level Keyword-Element Relationship Graph (E- KERG)  A keyword-element captures a keyword k and the data element mentioning k  A relationship between two keyword-elements exists iff there is a path between their associated data elements  In d-max KERG, the paths to be considered have length d-max or less Freebase DBLP DBPedia pub4 per4 prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ uni1 per2 per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turin University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 17 National Laboratory of the Helmholtz Association
  • 17. Schema-level Keyword-Element Relationship Graph (S-KERG)  A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k  A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements  Groups ele. (rel.) when they capture same keyword (rel. between same classes) Freebase DBLP DBPedia Article pub4 Person per4 Prize prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ University uni1 Person per2 Author per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turin University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 18 National Laboratory of the Helmholtz Association
  • 18. Data-Source-level Keyword-Element Relationship Graph (D-KERG)  A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k  A relationship between two keyword-elements exists if there is a path between some instances of their associated sources  Groups ele. (rel.) when they capture same keyword (rel. between same sources) Freebase DBLP DBPedia Article pub4 Person per4 Prize prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ University uni1 Person per2 Author per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turin University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 19 National Laboratory of the Helmholtz Association
  • 19. Agenda  Searching Linked Data  Opportunities & challenges  Keyword Query Routing  Problem Definition  Summary Models  Experiments  Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking  Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 21 National Laboratory of the Helmholtz Association
  • 20. Experiments  Chunk of the BTC dataset containing 10M RDF triples from 154 sources, linked via 500K mappings  Manually crafted 30 keyword valid multi-data- source queries, i.e., produce non-empty keyword answers and involve more than 2 sources  Town River America  Beijing Conference Database 2007 Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 22 National Laboratory of the Helmholtz Association
  • 21. Validity  P@k measure the percentage of plans that are valid out of the top-k plans  P@5 for KS only 6%, P@5 up to 100% for E-KERG (dmax =4)  More valid plans were computed when a higher value was used for dmax  dmax =3 seems to be a good tradeoff  Queries with larger number of keywords resulted in lower precision 1.0 1.0 E-KERG D-KERG E-KERG 0.9 0.9 D-KERG S-KERG KS 0.8 0.8 0.7 S-KERG 0.7 0.6 KS 0.6 P@5 P@5 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0 1 2 3 4 2 3 4 5 dmax |K| Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 23 National Laboratory of the Helmholtz Association
  • 22. Performance  Times increased with higher values for dmax  Sharp for E-KERG and S-KERG  Relatively stable for D-KERG  Times increase with number of keywords  All other models had poor performance w.r.t complex queries but D-KERG  E-KERG needed more than 100s for queries with more than 2 keywords  Time for D-KERG was no more than 10ms on average S-KERG D-KERG KS E-KERG S-KERG D-KERG KS E-KERG 1000000 1000000 Query Processing Time (ms) Query Processing Time (ms) 100000 100000 10000 10000 1000 1000 100 100 10 10 1 1 0 1 2 3 4 2 3 4 5 dmax |K| Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 24 National Laboratory of the Helmholtz Association
  • 23. Agenda  Searching Linked Data  Opportunities & challenges  Keyword Query Routing  Problem Definition  Summary Models  Experiments  Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking  Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 27 National Laboratory of the Helmholtz Association
  • 24. Mixed Query Processing Strategy  Combination of top-down and bottom-up strategies  Top-down: partial local index of sources, not assumed to be complete  Bottom-up: new sources are discovered at run-time  Corrective Source Ranking  Deal with heterogeneous source descriptions  Adaptive re-ranking  Stream-based Query Processing  Deal with unpredictable nature of Linked Data access Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and ISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
  • 25. Agenda  Searching Linked Data  Opportunities & challenges  Keyword Query Routing  Problem Definition  Summary Models  Experiments  Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking  Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 29 National Laboratory of the Helmholtz Association
  • 26. Stream-based Query Processing Results  Compile-time  Construct query plan Query Plan Join  Probe local index for sources Join name(?y, ?n)  Network latency  Do not block! worksAt(?x, dbpedia:KIT) knows(?x, ?y)  Evaluation driven by Samples incoming data  Run-time Push  Retrieve sources Source Retrieval Retrieve Source Ranker  Push data into query plan Source Retriever 1 source Source 1 (score: 1.0)  Discover new sources Source Retriever 2 Source Source 2 (score: 0.7) discovered ...  Rank sources ... Linked Local Data source index Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and ISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
  • 27. Agenda  Searching Linked Data  Opportunities & challenges  Keyword Query Routing  Problem Definition  Summary Models  Experiments  Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking  Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 31 National Laboratory of the Helmholtz Association
  • 28. Corrective Source Ranking  Prefer more relevant sources  Relevancy of a source is based on  Current query  Any available intermediate results  Overall optimization goal  Define a set of source features and derive concrete source metrics  Not all metrics are available for all sources (heterogeneity)  Refine previously computed metrics using newly discovered information (intermediate results, samples) Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and ISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
  • 29. Evaluation  Three systems: top-down (TD), bottom-up (BU), mixed (MI)  8 queries over various datasets (DBpedia, Geonames, NYT)  To make the approaches comparable, sources were restricted to those discoverable by the BU approach  ~6200 sources, containing ~500k triples  Sources hosted on local proxy server with artificial delay of 2 seconds  25% of sources were randomly chosen to construct index for MI Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and ISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
  • 30. Results Overall early result reporting 25% results: MI 8.7s, BU 15.1s 50% results: MI 12.8s, BU 22.0s Improvement of ~42% Detailed results for two queries: Query 1 Query 6 BU MI TD BU MI TD 25% Results 24810.5 10300.0 11038.0 8222.5 4743.5 5545.0 50% Results 43464.5 40782.0 15787.0 10961.5 7650.5 5634.0 Total 84066.5 86895.5 44323.5 24086.0 20711.0 16469.0 Src. Selection 0.0 853.0 1444.5 0.0 1331.0 1863.5 Ranking 25.5 2404.0 411.5 23.5 292.5 335.0 Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and ISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
  • 31. Result Arrival Times Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and ISWC 2010, Shanghai, China National Laboratory of the Helmholtz Association
  • 32. Agenda  Searching Linked Data  Opportunities & challenges  Keyword Query Routing  Problem Definition  Summary Models  Experiments  Linked Data Query Processing  Combining Top-down & Bottom-up  Stream-based Query Processing  Corrective Source Ranking  Conclusions Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 39 National Laboratory of the Helmholtz Association
  • 33. Conclusions  Keyword query routing  Helps users without knowledge of linked data and schemas to find combination of sources that contain answers corresponding to their needs  Focus on relevant combinations  Summarizing at the level of sources (D-KERG) represents the most practical trade-off, produces results in less than 10ms out of which every second one was valid  Stream-based query processing helps to deal with unpredictable nature of Linked data  Corrective, mixed strategy that incorporate new sources and knowledge at run-time for optimization (source ranking) helped to report early results 42% faster on average Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 40 National Laboratory of the Helmholtz Association
  • 34. Thanks for Your Attention! Institute AIFB, KIT ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 41 National Laboratory of the Helmholtz Association

Editor's Notes

  • #5: More complex information needs More precise results More integrated results
  • #6: So far, these requirements have proven to be a large burden. Given the amount of linked data is large and continuously evolving, it is inherently dicultto know what is in there (i.e., the data and the schema) and to formulate the corresponding structured queries for addressing some given information needs.Hence, it is desirable to have a mechanism, which allows users to express information needs in their own words. Another aspect of dealing with the large Web of linked data is scalability. Processing the needs against the entire Web might be too time consuming and not needed, especially when users are interested in and want to choose some particular sources of information. Processing against a relevant subset of linked data identied by the user is more scalable and possibly the only practical solution for the large Web of linked data.
  • #7: (Rank combination of sources)(Automatically process relevant combination of sources)Concerning these problems, the question we deal with is given the needs expressed by users as sets of keywords, are there corresponding answers in linked data and what combination of data sources shall be used to produce them?Further, the aim is not to directly compute results but to quickly identify and let users andsystem focus on the combination of sources that produce non-empty results.They recognized the fact that the computational complexity resulting from a large-scalesetting can be partially addressed when allowing users to choose and retrieve an-swers from only some particular databases. Given a set of keywords, the goal is tond and rank the single most relevant databases that contain the answers. Follow-ing this line, we propose specic solutions for the linked data context. The dier-
  • #9: Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  • #10: Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  • #11: Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  • #13: A valid plan in our example is RP = fFreebase;DBLP;DBPediag. Note that validity does not imply relevance. That is, a valid plan ensures that resultscan be produced, but for the users, these results may dier in relevance. A properaccount of relevance and the ranking of routing plans based on the relevance oftheir results go beyond the scope of this paper, which is focused on eciencyaspects of computing valid plans. We assume a xed ranking function, whichequally applies to all summaries discussed in this paper. We refer the interestedreaders to our report [8], which discusses relevance and the ranking function.
  • #16: - Keywords map against elements of the entire data web- Routing simply based on coverage- Consider further factors for data source identification, i.e. characteristics of the data, the data sources and links between them-Keyword query routing: Keyword routing in a truly distributed setting such that several data sources might be used to answer a set of keywordsOnly the highly relevant data sources are selected to answer the user query
  • #17: Elements stands for all the keywordsthat are mentioned in elements of the graphs G. Every nKSk 2 NKSKis in fact atuple (k; Gk) that represents a keyword k and the graphs Gk G mentioning k.
  • #18: Elements stands for all the keywordsthat are mentioned in elements of the graphs G. Every nKSk 2 NKSKis in fact atuple (k; Gk) that represents a keyword k and the graphs Gk G mentioning k.
  • #19: As opposed to E-KERG, this one is indeed a summary model because itclusters two element-level relationships (hki; nKi (ni; gi;Ki)i; hkj ; nKj (nj ; gj ;Kj)i)and (hkv; nKv (nv; gv;Kv)i; hkw; nKw(nw; gw;Kw)i) to one schema-level relation-ship when they capture the same keyword relationships (i.e., ki = kvand kj = kw) between the same classes (i.e, n0i = n0v and n0j =
  • #20: As opposed to E-KERG, this one is indeed a summary model because itclusters two element-level relationships (hki; nKi (ni; gi;Ki)i; hkj ; nKj (nj ; gj ;Kj)i)and (hkv; nKv (nv; gv;Kv)i; hkw; nKw(nw; gw;Kw)i) to one schema-level relation-ship when they capture the same keyword relationships (i.e., ki = kvand kj = kw) between the same classes (i.e, n0i = n0v and n0j =
  • #21: Intuitively speaking, this procedure simply retrieves sources that cover thekeywords and in order to cover all jKj query keywords, it uses jKj-combinationsof these sources as routing plans.
  • #23: - Summary allows for routing plan computation that is complete but not sound, different complexities (see paper)
  • #24: values represent the average computed for all 30 queries. Using E-KERG, precision was up to 100 percent, i.e., for dsum max = ddatamax = 4. With P@5 being always above 0.6 whendmax > 1, S-KERG and D-KERG also achieved relatively good results. P@5 for KS was only 6%. Clearly, dmax had a positive effect. More valid plans werecomputed when a higher value was used for dmax. However, using dmax = 4instead of 3 did not yield clear improvemenFig. 4b shows the eect of query length jKj. Quite clear, queries with largernumber of keywords resulted in lower precision. It dropped as low as 0.23 whenusing D-KERG for queries with 5 keywords.KS is the model that produces only very few valid plans. This result was improved byone order of magnitude when relationships between keywords were used. The morene-grained a model captures the relationships, the larger was the percentage ofvalid plans. Even a summary at the level of sources produced reasonably highquality results, i.e., every second plan was a valid one
  • #25: Performance is measured as the average response time for com-puting routing plans. Fig. 5a shows the performance for queries at various settingsusing dierent values for dmax. This parameter had no eect on the KS's resultsbut clearly inuenced the performance achieved with KERG summaries. Times increased with higher values for dmax. While this increase was sharp for E-KERGand S-KERG, time performance of D-KERG was relatively stable. In particular,time required by D-KERG was no more than 10ms on average.While the times shown are the actual times obtainedfor the other models, only the lower bound was shown for E-KERG. This is be-cause we applied a timeout of 6min. Fig. 5c shows the exact times obtained forE-KERG and the queries that had to be aborted due to timeout. For dmax = 4for instance, 1 out of every three queries was abortedExpectedly, more time was needed when the number of query keywords in-creases, as illustrated in Fig. 5b. It seems that all the other models had poorperformance w.r.t complex queries but D-KERG.
  • #29: Process data as they come instead of blocking / waiting
  • #39: In terms of total execution time, MI and BU are comparable, while TDis signicantly faster in most cases. While TD incurs more overhead for theinitial source selection because of the larger index, it enables the exclusion ofsources. Due to the high network cost, not retrieving irrelevant sources resultsin a signicant performance gain. Using only a partial index, MI is not able torestrict the number of sources that have to be retrieved.even after thenal result was reported other relevant sources had to be processed, but did notcontribute to the nal result. This indicates that early result reporting resultingin better responsiveness is very important in some cases, where processing allsources might be very costly and not needed. Clearly, TD produced results earlierthan MI, which was better than BU.
  • #41: We presented a solution to the novel problem of keyword query routing. It helpsusers without knowledge of the evolving linked data and schema to ndcombina-tion of sources that contain answers corresponding to their needs. This solutionalso partially addresses the aspect of eciency as queries can be then evaluatedagainst the relevant sources identied by the user, instead of using the entire Webof linked data.We have proposed a family of summary models. Through theoretical and ex-perimental analysis, we showed that it is important to capture keyword relation-ships. Compared to the KS model representing the naive baseline that stores onlysingle keywords, the KERG models relying on relationships could produce a much