SlideShare a Scribd company logo
What makes a linked data pattern
interesting?
Szymon Klarman
Department of Computer Science
Brunel University London
June 7, 2016
Connected Data London
#ConnectedData2016
Linked Data
 data/knowledge represented in W3C standards OWL/RDF(S)
 flexible, unrestrictive, extendible
 machine (and human) accessible
 connected into a global Web of Data
 (open) and reusable (and when combined great things might happen!)
 perfectly functional also in closed environments
RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
has entity idlabel
GRB2 regulates GAB1 UniProt:P34723
RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
Regulation
Molecular interaction
Biological event
subclass of
subclass of
has participant A
has participant
subproperty pf
domain range
Chemical
has participant B
RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
has participant
Molecular Interaction
Biological event
type
type
Chemical
type
Regulation
Molecular interaction
Biological event
subclass of
subclass of
has participant A
has participant
subproperty pf
domain range
Chemical
has participant B
RDF(S) = graph structure + logical inference
b p
has participant A
Regulation Protein
type type
Querying:
?z ?y
has participant
Biological event Chemical
type type
Regulation
Molecular interaction
Biological event
has participant A
has participant
domain range
Chemical
has participant B
subclass of
subclass of
subproperty pf
Linked data mining
Emerging field: Workshop on Knowledge Discovery and Data Mining Meets
Linked Open Data since 2012 (+ Linked Data Mining Challange).
Problems:
 finding novel/surprising/interesting linked data patterns
 identifying relevant semantic connections
 predicting facts/links in knowledge graphs
Most modest yet fundamental task:
What’s in that linked data set?
 Web of Data will soon contain a lot of significant answers (42!)...
 ...so we need to know how to ask the right question...
 ...so we need to understand what’s in these data set.
Examples are from the Big Mechanism project (http://52.26.26.74/).
So what’s in that linked data set?
So what’s in that linked data set?
So what’s in that linked data set?
Too much too noisy...
So what’s in that linked data set?
So what’s in that linked data set?
No structure...
Ontologies on the Web of Data
Concept & property hierarchies + type assertions make up most of the Web of Data.
B. Glimm, A. Hogan, M. Krötzsch, A. Polleres: „OWL: Yet to arrive on the Web of Data?”, 2012
Typical ontologies don’t reflect the actual
graph structure of data...
Biological event
Chemical / Event
Statement
Article
Journal
representsis represented by
is extracted from
Molecular interaction
has participanttype
Submitter
has submitter
The actual „conceptual data model”
published in
GRB2_regulates_GAB1
statement_1
GRB2_MOUSE GAB1_MOUSE
has participant A has participant B
NaCTeM
has submitter
PMC123456
extracted from
Regulation
Protein
Statement
ArticleSubmitter
type
type
typetype
typetype
Linked data pattern
represents is represented by
Biological event
type
?z
?u
?x ?y
has participant A has participant B
?v
has submitter
?w
extracted from
Regulation
Protein
Statement
ArticleSubmitter
type
type
typetype
typetype
Linked data pattern
represents is represented by
Biological event
type
?z
?u
?x ?y
has participant A has participant B
?v
has submitter
?w
extracted from
Regulation
Protein
Statement
ArticleSubmitter
type
type
typetype
typetype
Linked data pattern
represents is represented by
Linked data pattern ≈ conjunctive query / graph query
Query is a set of triples of the form:
( ?x type Concept )
( ?x Property ?y )
Linked data mining ≈ search through the query space
Biological event
type
When is a linked data pattern interesting?
Two evaluation criteria:
 Frequency: the pattern has relatively many matches in the set;
 Semantic content: the pattern contains relatively much information.
Frequency is the central criterion for the related problem of frequent
subgraph mining in the graph & multi-relational data setting.
⇢ linked data is graph data.
Semantic content criterion originates in logical/semantic theories of
information, and is used in inductive logic programming.
⇢ linked data is grounded in logic.
There is an inherent trade-off between the two criteria.
Frequency
The most frequent linked data patterns out there will always be:
X is something...
Something is somehow related to something else...
?x ?y
owl:topObjectProperty
owl:Thing
typetype
X is an event of type...?
Semantic content
regulation
molecular interaction
biological event
The more possibilities you exclude the more you say.
owl:Thing
Semantic content
The linked data pattern with the most
semantic content is the entire RDF graph...
Pattern Q1 has more semantic content than pattern Q2 (over ontology O)
if
Q1 (with O) logically entails Q2
?z ?y
has participant A
Regulation Protein
type type
?z ?y
has participant
Biological event Chemical
type type
Trade-off
FREQ (Q) CONT (Q)
VALUE(Q) =
weighted sum of FREQ(Q) and CONT(Q)
1 - Prob(Q is true a priori)#answers / #possible answers
0
0.2
0.4
0.6
0.8
1
1.2
0 100 200 300 400 500 600 700 800 900
Value Freq Cont
Trade-off
0
0.2
0.4
0.6
0.8
1
1.2
0 100 200 300 400 500 600 700 800 900
Value Freq Cont
Q1 = textual_entity(x)
Q2 = statement(x)
Q3 = event(x)
Q4 = journal_article(x), published_in(x, u), journal(u),
is_extracted_from(w, x), statement(w), contained_in(w, y),
table(y), represents(w, v), negative_regulation(v),
has_submitter(y, z), submitter(z), [...] (10 variables)
Q5 = table(x), has_submitter(x, z), submitter(z), contains_statement(x, y), statement(y), contained_in(y, x)
Q6 = positive_regulation(z), is_represented_by(z, y), statement(y), represents(y, z), contained_in(y, x),
table(x), has_submitter(x, v), submitter(v), contains_statement(x, y).
Algorithm
The space of all patterns over realistic linked data sets is virtually infinite.
But there are some good search heuristics:
 use precomputed „promising” building blocks;
 „climb up” over the most successful queries so far (but use a restart rule
to avoid getting stuck locally).
0
0.2
0.4
0.6
0.8
1
1.2
0 100 200 300 400 500 600 700 800 900
Value Freq Cont
What’s next...
The question „what’s in that linked data set?” is perhaps not the major one,
but the suggested notion of interestingness might well be:
 „frequency vs. semantic content” trade-off reflects the dual – graphical
and logical – nature of the RDF(S) representation model.
 many of the linked data mining tasks can be described as: given Q2 find
an interesting Q1 such that:
Q1 ⇢ Q2
 other, more abstract criteria might be also necessary.
Linked data mining requires novel principles and foundational approaches.

More Related Content

DOCX
Ontology based clustering algorithms
PPTX
20130622 okfn hackathon t2
PPTX
Knowledge graphs on the Web
PPTX
The Challenge of Deeper Knowledge Graphs for Science
PDF
Knowledge Graph Maintenance
PPTX
Thoughts on Knowledge Graphs & Deeper Provenance
PPTX
Data Communities - reusable data in and outside your organization.
PPTX
Thinking About the Making of Data
Ontology based clustering algorithms
20130622 okfn hackathon t2
Knowledge graphs on the Web
The Challenge of Deeper Knowledge Graphs for Science
Knowledge Graph Maintenance
Thoughts on Knowledge Graphs & Deeper Provenance
Data Communities - reusable data in and outside your organization.
Thinking About the Making of Data

What's hot (20)

PDF
Knowledge Graph Maintenance
PPTX
Content + Signals: The value of the entire data estate for machine learning
PPT
2011linked science4mccuskermcguinnessfinal
PDF
Knowledge Representation on the Web
PDF
Instance-Based Ontological Knowledge Acquisition
PDF
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
PDF
Mid-Ontology Learning from Linked Data @JIST2011
PDF
Question Answering over Linked Data (Reasoning Web Summer School)
PDF
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
PPTX
From Data Search to Data Showcasing
PDF
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
PDF
Interlinking educational data to Web of Data (Thesis presentation)
PPTX
Knowledge Graph Engineering
PDF
Drug Repurposing using Deep Learning on Knowledge Graphs
PPTX
Self adaptive based natural language interface for disambiguation of
PPT
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PDF
How to clean data less through Linked (Open Data) approach?
PPTX
Modular design patterns for systems that learn and reason: a boxology
PDF
A Non-Technical, Example-Driven Introduction to Linked Data
PPTX
Semantics as a service at EMBL-EBI
Knowledge Graph Maintenance
Content + Signals: The value of the entire data estate for machine learning
2011linked science4mccuskermcguinnessfinal
Knowledge Representation on the Web
Instance-Based Ontological Knowledge Acquisition
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Mid-Ontology Learning from Linked Data @JIST2011
Question Answering over Linked Data (Reasoning Web Summer School)
A Linked Data Prototype for the Union Catalog of Digital Archives Taiwan
From Data Search to Data Showcasing
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Interlinking educational data to Web of Data (Thesis presentation)
Knowledge Graph Engineering
Drug Repurposing using Deep Learning on Knowledge Graphs
Self adaptive based natural language interface for disambiguation of
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
How to clean data less through Linked (Open Data) approach?
Modular design patterns for systems that learn and reason: a boxology
A Non-Technical, Example-Driven Introduction to Linked Data
Semantics as a service at EMBL-EBI
Ad

Similar to What makes a linked data pattern interesting? (20)

PDF
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
PDF
Dimensionality reduction by matrix factorization using concept lattice in dat...
PPT
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
PPTX
Contextual Ontology Alignment - ESWC 2011
PDF
La résolution de problèmes à l'aide de graphes
PPTX
How the Web can change social science research (including yours)
PPT
Information Retrieval and Storage Systems
PPTX
Nimrita koul Machine Learning
PDF
bridging formal semantics and social semantics on the web
PDF
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
PPTX
IBC FAIR Data Prototype Implementation slideshow
PPTX
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
PPT
Fusing semantic data
PPTX
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
PDF
Gf o2014talk
PPTX
Rules for inducing hierarchies from social tagging data
PDF
What's next in Julia
PPT
Intelligent Methods in Models of Text Information Retrieval: Implications for...
PPTX
EDBT 2015: Summer School Overview
PPT
Information Networks And Their Dynamics
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Dimensionality reduction by matrix factorization using concept lattice in dat...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
Contextual Ontology Alignment - ESWC 2011
La résolution de problèmes à l'aide de graphes
How the Web can change social science research (including yours)
Information Retrieval and Storage Systems
Nimrita koul Machine Learning
bridging formal semantics and social semantics on the web
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
IBC FAIR Data Prototype Implementation slideshow
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Fusing semantic data
Extracting Relevant Questions to an RDF Dataset Using Formal Concept Analysis
Gf o2014talk
Rules for inducing hierarchies from social tagging data
What's next in Julia
Intelligent Methods in Models of Text Information Retrieval: Implications for...
EDBT 2015: Summer School Overview
Information Networks And Their Dynamics
Ad

More from Szymon Klarman (11)

PDF
HyperGraphQL
PDF
Formal Verification of Data Provenance Records
PDF
Data driven approaches to empirical discovery
PDF
ABox Abduction in the Description Logic
PDF
Judgment Aggregation as Maximization of Epistemic and Social Utility
PDF
Description Logics of Context
PDF
Prediction and Explanation over DL-Lite Data Streams
PDF
Querying Temporal Databases via OWL 2 QL
PDF
Ontology learning from interpretations in lightweight description logics
PDF
Knowledge Assembly at Scale with Semantic and Probabilistic Techniques
PDF
SKOS: Building taxonomies with minimum ontological commitment
HyperGraphQL
Formal Verification of Data Provenance Records
Data driven approaches to empirical discovery
ABox Abduction in the Description Logic
Judgment Aggregation as Maximization of Epistemic and Social Utility
Description Logics of Context
Prediction and Explanation over DL-Lite Data Streams
Querying Temporal Databases via OWL 2 QL
Ontology learning from interpretations in lightweight description logics
Knowledge Assembly at Scale with Semantic and Probabilistic Techniques
SKOS: Building taxonomies with minimum ontological commitment

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Approach and Philosophy of On baking technology
PPTX
Cloud computing and distributed systems.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
A comparative analysis of optical character recognition models for extracting...
Review of recent advances in non-invasive hemoglobin estimation
Approach and Philosophy of On baking technology
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
“AI and Expert System Decision Support & Business Intelligence Systems”

What makes a linked data pattern interesting?

  • 1. What makes a linked data pattern interesting? Szymon Klarman Department of Computer Science Brunel University London June 7, 2016 Connected Data London #ConnectedData2016
  • 2. Linked Data  data/knowledge represented in W3C standards OWL/RDF(S)  flexible, unrestrictive, extendible  machine (and human) accessible  connected into a global Web of Data  (open) and reusable (and when combined great things might happen!)  perfectly functional also in closed environments
  • 3. RDF(S) = graph structure + logical inference b p has participant A Regulation Protein type type has entity idlabel GRB2 regulates GAB1 UniProt:P34723
  • 4. RDF(S) = graph structure + logical inference b p has participant A Regulation Protein type type
  • 5. RDF(S) = graph structure + logical inference b p has participant A Regulation Protein type type Regulation Molecular interaction Biological event subclass of subclass of has participant A has participant subproperty pf domain range Chemical has participant B
  • 6. RDF(S) = graph structure + logical inference b p has participant A Regulation Protein type type has participant Molecular Interaction Biological event type type Chemical type Regulation Molecular interaction Biological event subclass of subclass of has participant A has participant subproperty pf domain range Chemical has participant B
  • 7. RDF(S) = graph structure + logical inference b p has participant A Regulation Protein type type Querying: ?z ?y has participant Biological event Chemical type type Regulation Molecular interaction Biological event has participant A has participant domain range Chemical has participant B subclass of subclass of subproperty pf
  • 8. Linked data mining Emerging field: Workshop on Knowledge Discovery and Data Mining Meets Linked Open Data since 2012 (+ Linked Data Mining Challange). Problems:  finding novel/surprising/interesting linked data patterns  identifying relevant semantic connections  predicting facts/links in knowledge graphs Most modest yet fundamental task: What’s in that linked data set?  Web of Data will soon contain a lot of significant answers (42!)...  ...so we need to know how to ask the right question...  ...so we need to understand what’s in these data set. Examples are from the Big Mechanism project (http://52.26.26.74/).
  • 9. So what’s in that linked data set?
  • 10. So what’s in that linked data set?
  • 11. So what’s in that linked data set? Too much too noisy...
  • 12. So what’s in that linked data set?
  • 13. So what’s in that linked data set? No structure...
  • 14. Ontologies on the Web of Data Concept & property hierarchies + type assertions make up most of the Web of Data. B. Glimm, A. Hogan, M. Krötzsch, A. Polleres: „OWL: Yet to arrive on the Web of Data?”, 2012 Typical ontologies don’t reflect the actual graph structure of data...
  • 15. Biological event Chemical / Event Statement Article Journal representsis represented by is extracted from Molecular interaction has participanttype Submitter has submitter The actual „conceptual data model” published in
  • 16. GRB2_regulates_GAB1 statement_1 GRB2_MOUSE GAB1_MOUSE has participant A has participant B NaCTeM has submitter PMC123456 extracted from Regulation Protein Statement ArticleSubmitter type type typetype typetype Linked data pattern represents is represented by Biological event type
  • 17. ?z ?u ?x ?y has participant A has participant B ?v has submitter ?w extracted from Regulation Protein Statement ArticleSubmitter type type typetype typetype Linked data pattern represents is represented by Biological event type
  • 18. ?z ?u ?x ?y has participant A has participant B ?v has submitter ?w extracted from Regulation Protein Statement ArticleSubmitter type type typetype typetype Linked data pattern represents is represented by Linked data pattern ≈ conjunctive query / graph query Query is a set of triples of the form: ( ?x type Concept ) ( ?x Property ?y ) Linked data mining ≈ search through the query space Biological event type
  • 19. When is a linked data pattern interesting? Two evaluation criteria:  Frequency: the pattern has relatively many matches in the set;  Semantic content: the pattern contains relatively much information. Frequency is the central criterion for the related problem of frequent subgraph mining in the graph & multi-relational data setting. ⇢ linked data is graph data. Semantic content criterion originates in logical/semantic theories of information, and is used in inductive logic programming. ⇢ linked data is grounded in logic. There is an inherent trade-off between the two criteria.
  • 20. Frequency The most frequent linked data patterns out there will always be: X is something... Something is somehow related to something else... ?x ?y owl:topObjectProperty owl:Thing typetype X is an event of type...?
  • 21. Semantic content regulation molecular interaction biological event The more possibilities you exclude the more you say. owl:Thing
  • 22. Semantic content The linked data pattern with the most semantic content is the entire RDF graph... Pattern Q1 has more semantic content than pattern Q2 (over ontology O) if Q1 (with O) logically entails Q2 ?z ?y has participant A Regulation Protein type type ?z ?y has participant Biological event Chemical type type
  • 23. Trade-off FREQ (Q) CONT (Q) VALUE(Q) = weighted sum of FREQ(Q) and CONT(Q) 1 - Prob(Q is true a priori)#answers / #possible answers 0 0.2 0.4 0.6 0.8 1 1.2 0 100 200 300 400 500 600 700 800 900 Value Freq Cont
  • 24. Trade-off 0 0.2 0.4 0.6 0.8 1 1.2 0 100 200 300 400 500 600 700 800 900 Value Freq Cont Q1 = textual_entity(x) Q2 = statement(x) Q3 = event(x) Q4 = journal_article(x), published_in(x, u), journal(u), is_extracted_from(w, x), statement(w), contained_in(w, y), table(y), represents(w, v), negative_regulation(v), has_submitter(y, z), submitter(z), [...] (10 variables) Q5 = table(x), has_submitter(x, z), submitter(z), contains_statement(x, y), statement(y), contained_in(y, x) Q6 = positive_regulation(z), is_represented_by(z, y), statement(y), represents(y, z), contained_in(y, x), table(x), has_submitter(x, v), submitter(v), contains_statement(x, y).
  • 25. Algorithm The space of all patterns over realistic linked data sets is virtually infinite. But there are some good search heuristics:  use precomputed „promising” building blocks;  „climb up” over the most successful queries so far (but use a restart rule to avoid getting stuck locally). 0 0.2 0.4 0.6 0.8 1 1.2 0 100 200 300 400 500 600 700 800 900 Value Freq Cont
  • 26. What’s next... The question „what’s in that linked data set?” is perhaps not the major one, but the suggested notion of interestingness might well be:  „frequency vs. semantic content” trade-off reflects the dual – graphical and logical – nature of the RDF(S) representation model.  many of the linked data mining tasks can be described as: given Q2 find an interesting Q1 such that: Q1 ⇢ Q2  other, more abstract criteria might be also necessary. Linked data mining requires novel principles and foundational approaches.