SlideShare a Scribd company logo
Harnessing Relationships for Domain-specific Subgraph
Extraction: A Recommendation Use Case
IEEE International Conference on Big Data
Washington D.C.,, USA, 5th – 8th December, 2016
Sarasi Lalithsena Pavan Kapanipathi Amit Sheth
sarasi@knoesis.org amit@knoesis.orgkapanipa@us.ibm.com
Kno.e.sis Research Center
Wright State University
Thomas J. Watson Research Center
IBM Research
Kno.e.sis Research Center
Wright State University
Knowledge Graphs on the Web
• Represent the data in structured format using a graph-based data
model
2
Linked Open Data
Google Knowledge Graph
Schema.org annotation
570M entities and 18B facts
> 1000 datasets
6.2M entities and 1B facts
8M entities and 70M facts
Knowledge Graph In Action
3
IBM Watson uses YAGO Hierarchy to
extract the types
Movie recommendation algorithms use
DBpedia and Linked MovieDB to
determine how two movies are
semantically relevant
Motivation
• Utilizing large cross-domain KGs can get computationally intensive
• Existing approaches extract relevant subgraphs by navigating
predefined number of hops (2-4) from known domain entities
4
A Movie recommendation system
extracts the subgraph by navigating 3
hops using 3072 movies in DBpedia
The subgraph encompasses 66% of the
DBpedia entities.
Motivation
• Certain applications are domain-specific and do not require the
complete knowledge graphs
5
Transformers
(2007)
The
Terminator
Cursed
Random
Hearts
Michael
Bay
James
Cameron
Wes
Craven
Sydney
Pollack
Action
Film
Los Angeles
director
director
director
director
knownFor
knownFor
deathCity
deathCity
Relevant for movie
recommendation
Not relevant for
movie
recommendation
Problem
6
How do we extract the domain-specific
subgraph from large cross-domain knowledge
graphs without compromising the accuracy?
Relationship is the key
7
Transformers
(2007)
The
Terminator
Cursed
Random
Hearts
Michael
Bay
James
Cameron
Wes
Craven
Sydney
Pollack
Action
Film
Los Angeles
director
director
director
director
knownFor
knownFor
deathCity
deathCity
Relevant for movie
recommendation
Not relevant for
movie
recommendation
Domain Specificity Measures for Relationships
• Association of the relationship with domain entities provides
evidence for domain specificity
8
m1
• Relationship director is specific to the movie
domain
• Relationship country is not specific to the movie
domain
• Association of the relationship with the domain
entities is straightforward with direct
relationships such as director and country
• However, it is not trivial for other relationships
such as award, spouse, and capital
Movie
Domain Specificity Measures for Relationships
• To measure the domain-specificity of both direct and indirect
relationships, we identify two characteristics of a dataset:
– Entity Type
– Property Path
• We formalize these two characteristics to calculate domain
specificity of a relationship
9
Type-based Domain Specificity Measure
Measure uses the association between entity types
10
m1
spouse
Movie
Director
Country
type
type
type
Strong association
Weak association
• Strength of association between the domain
entity type to the other entity type
Association between Movie type and
Director type
• Strength of association between the entity
type to the relationship
Association between award
relationship and Director type
Type-based Domain Specificity Measure
• Strength of association between directly connected entity types
𝑑_𝑡𝑦𝑝𝑟𝑒𝑙(𝑡𝑖, 𝑡𝑗) =
𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡𝑡𝑖,𝑡 𝑗
𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡𝑡𝑖
∗ 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡𝑡 𝑗
• Strength of association between indirectly connected entity types
𝑖𝑛𝑑_𝑡𝑦𝑝𝑒𝑟𝑒𝑙(𝑡 𝑑, 𝑡 𝑛−1, 𝑛) =
𝑘=1
𝑛−1
𝑑_𝑡𝑦𝑝𝑒𝑟𝑒𝑙(𝑡 𝑘−1, 𝑡 𝑘)
• Strength of association between entity types and their direct relationships
𝑝𝑟𝑜𝑝_𝑟𝑒𝑙 𝑝, 𝑡 =
𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡 𝑝,𝑡
𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡 𝑝
𝑝𝑟𝑜𝑝_𝑠𝑐𝑜𝑟𝑒(𝑝, 𝑛) =
𝑡 𝑛−1 𝑗
∈𝐶
𝑖𝑛𝑑_𝑡𝑦𝑝𝑒𝑟𝑒𝑙 𝑡 𝑑, 𝑡 𝑛−1 𝑗
, 𝑛 ∗ 𝑝𝑟𝑜𝑝_𝑟𝑒𝑙(𝑝, 𝑡 𝑛−1 𝑗
)
11
D H1 H2 Hn-1 Hnp1
Movie Director Award
H3p2 p3 Pn+1pn
Between Movie and
Director
Between Movie and
Award
Between P3 and Award
nth Hop
Path-based Domain Specificity Measure
Measure uses the association between intermediate relationships
12
m1
I1
m2
I2
I3
I4
I5
m1
m2
I1
I3
I5
I6
I7
I8
I9
I10
I11
I12
I13
I14
I15
• Uses an iterative approach
by considering already
identified domain-specific
paths
Path-based Domain Specificity Measure
• Domain specificity of nth hop relationship depends on domain-specific paths of length n -1
𝑃𝑀𝐼 𝑝, 𝑑𝑠𝑝 𝑛−1 = 𝑙𝑜𝑔
𝑃𝑟𝑜𝑏(𝑝,𝑑𝑠𝑝 𝑛−1)
𝑃𝑟𝑜𝑏(𝑝) ∗𝑃𝑟𝑜𝑏(𝑑𝑠𝑝 𝑛−1)
𝑃𝑟𝑜𝑏 𝑝, 𝑑𝑠𝑝 𝑛−1 =
𝑃𝑎𝑡ℎ(𝑑𝑠𝑝 𝑛−1,𝑝)
𝑝∈𝑃 𝑃𝑎𝑡ℎ(𝑑𝑠𝑝 𝑛−1,𝑝)
To address the PMI’s sensitivity to low frequent value,
𝑁𝑃𝑀𝐼 𝑝, 𝑑𝑠𝑝 𝑛−1 =
𝑙𝑜𝑔
𝑃𝑟𝑜𝑏(𝑝,𝑑𝑠𝑝 𝑛−1)
𝑝𝑟𝑜𝑏 𝑝 ∗𝑃𝑟𝑜𝑏(𝑑𝑠𝑝 𝑛−1)
− log 𝑃𝑟𝑜𝑏(𝑝, 𝑑𝑠𝑝 𝑛−1)
13
D H1 H2 Hn-1 Hn
nth Hop
H3
p1 p2
p3 pn
Pn + 1
Domain specific paths
Between p3 and Domain specific
path p1 – p2
Evaluation: Recommendation Use Case
• Evaluate the effectiveness of the domain-specific subgraph using a
recommendation use case
• Implement an existing recommendation algorithm and use the n-
hop expansion subgraph (baseline) and domain-specific subgraph as
the (our approach)
• Use two domains Movie and Book with existing dataset MovieLens
and DBbook
• MovieLens consists of 1,000,209 ratings for 3883 movies by 6,040
users and DBbook 72,372 ratings for 8,170 books by 6181 users
14
Evaluation Metrics
• Graph reduction
– Measure the reduction of the graph with nodes, relationships and reachable
paths
• Impact on accuracy
– Precision@n
– Rating Deviation
• Impact on run time
15
Evaluation Metrics – Graph Reduction
Path-based Type-based
Relations Nodes Paths Relations Nodes Paths
2-hop 349 1.07M 108.4M 349 1.07M 108.4M
DSG2(15,15) 15 (95.7%) 0.08M (92.0%) 5.08M (95.3%) 14 (95.9%) 0.13M (87.6%) 17M (83.9%)
DSG2(25,25) 25 (92.8%) 0.13M (87.3%) 17.4M (83.8%) 24 (93.1%) 0.63M (40.9%) 61.6M (43.19%)
DSG2(35,35) 35 (90%) 0.64M (40.7%) 61.64M (43.1%) 32 (90.8%) 0.64M (40.7%) 61.62M (43.18%)
16
Movie domain: 2-hop graphs
Book domain: 2-hop graphs
Path-based Type-based
Relations Nodes Paths Relations Nodes Paths
2-hop 424 1.2M 793.4M 424 1.2M 793.4M
DSG2(15,15) 15 (96.5%) 0.09M (92.8%) 159.6M (79.6%) 15 (96.5%) 0.09M (92.8%) 159.7M (80%)
Evaluation Metrics – Graph Reduction
Path-based Type-based
Relations Nodes Paths Relations Nodes Paths
Movie 3-hop 636 2.86M 4885.3M 636 2.86M 4885.3M
DSG3(15,25,15) 30 (95.3%) 0.19M (93.2%) 48.4M (98.9%) 24 (96.2%) 0.26M (90.9%) 105.5M (97.83%)
Book 3-hop 641 3.2M 13582.8M 641 3.2M 13582.8M
DSG2(15,25,15) 31 (95.2%) 0.18M (94.2%) 1082.6M (92.2%) 21 (96.7%) 0.12M (96%) 1062.5M (92.33%)
17
3-hop graphs
In average, domain-specific subgraph has a reduction of 80% to 90% from
the n-hop expansion sub graph
Evaluation Metrics – Precision@n
Movie 2-hop graphs 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑛 =
𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠
𝑛
16
Evaluation Metrics – Precision@n
19
Movie 3-hop graph
Evaluation Metrics – Precision@n
20
Book domain
Evaluation Metrics – Rating Dev
• Rating Dev
𝑟𝑎𝑡𝑖𝑛𝑔𝑑𝑒𝑣 𝑢 =
𝑟∈𝑅 𝑖𝑡𝑒𝑚𝑟𝑎𝑡𝑖𝑛𝑔 𝑟 − 𝑎𝑣𝑔𝑟𝑎𝑡𝑖𝑛𝑔 𝑢
𝑅
21
Movie Rating
m1 5
m2 3
m3 3
m4 1
m5 1
Relevant Movies
Irrelevant Movies
N-hop subgraph
m1
m2
DSG
m2
m3
Evaluation Metrics – AvgDev
22
Movie domain 2-hop 3-hop
Baseline DSG2(15,15) Baseline DSG3(15,25,15)
5 0.8222 0.823 0.807 0.823
10 0.814 0.816 0.806 0.815
15 0.810 0.811 0.806 0.811
20 0.806 0.807 0.805 0.806
2-hop 3-hop
Baseline DSG2(15,15) Baseline DSG3(15,25,15)
1 0.592 0.584 0.533 0.558
2 0.599 0.604 0.571 0.579
3 0.601 0.614 0.569 0.579
4 0.606 0.617 0.595 0.595
5 0.610 0.620 0.596 0.6
Book domain
Evaluation Metrics – Run Time Performance
23
Movie Book
n-hop
expansion
DSG n-hop
expansion
DSG
Path Type Path Type
2-hop 72s 5s 11.2s 10.15m 1.3m 1.4m
3-hop 2 h 35 m 76s 3.2 m 7 h 40 m 15.2m 27m
Conclusion
• Propose an approach to extract a domain-specific sub graph from a
large, cross-domain KG
• Treat the non-taxonomical relationships as the first class object
• Approach was able to reduce the graph size by more than 80% which
led to a tenfold decrease in computation time of the
recommendation algorithm
• Accuracy of the algorithm shows no compromise rather found more
accurate results
24
24
Thank You!
http://knoesis.wright.edu/people/sarasi
sarasi@knoesis.org
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
Harnessing Relationships for Domain-specific Subgraph
Extraction: A Recommendation Use Case
Presented at the IEEE International Conference on Big Data (SocInfo 2016)
Washington, USA, 5th – 8th December, 2016
Amit Sheth
amit@knoesis.org
Pavan Kapanipathi
kapanipa@us.ibm.com
Sarasi Lalithsena
sarasi@knoesis.org

More Related Content

PPT
Saliency-based Models of Image Content and their Application to Auto-Annotati...
PDF
Matrix Factorization In Recommender Systems
PDF
Scale Saliency: Applications in Visual Matching,Tracking and View-Based Objec...
PDF
Content-based image retrieval using a mobile device as a novel interface
PDF
DQN Variants: A quick glance
PDF
Algorithmic Music Recommendations at Spotify
PDF
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...
PPTX
RecSys Challenge 2014, SemWexMFF group
Saliency-based Models of Image Content and their Application to Auto-Annotati...
Matrix Factorization In Recommender Systems
Scale Saliency: Applications in Visual Matching,Tracking and View-Based Objec...
Content-based image retrieval using a mobile device as a novel interface
DQN Variants: A quick glance
Algorithmic Music Recommendations at Spotify
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...
RecSys Challenge 2014, SemWexMFF group

What's hot (8)

PDF
An Slight Overview of the Critical Elements of Spatial Statistics
PDF
[UMAP 2015] Integrating Context Similarity with Sparse Linear Recommendation ...
PDF
1 chayes
PDF
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
PDF
ResNeSt: Split-Attention Networks
PDF
improving explicit preference entry by visualising data similarities
PDF
4 image enhancement in spatial domain
PPT
Matteoli ieee gold_2010_clean
An Slight Overview of the Critical Elements of Spatial Statistics
[UMAP 2015] Integrating Context Similarity with Sparse Linear Recommendation ...
1 chayes
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
ResNeSt: Split-Attention Networks
improving explicit preference entry by visualising data similarities
4 image enhancement in spatial domain
Matteoli ieee gold_2010_clean
Ad

Similar to Domainspecificsubgraph extraction ieee-bigdata2016 (20)

PPTX
Recommender Systems from A to Z – The Right Dataset
PPTX
Movie recommendation Engine using Artificial Intelligence
PPTX
Movie lens movie recommendation system
PPT
Download
PPT
Download
PPTX
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
PPTX
Segmentation - based Historical Handwritten Word Spotting using document-spec...
PPTX
Attentive Relational Networks for Mapping Images to Scene Graphs
PDF
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
PPTX
Computer Vision Landscape : Present and Future
PPTX
Rokach-GomaxSlides.pptx
PPTX
Rokach-GomaxSlides (1).pptx
PDF
Building Identity Graphs over Heterogeneous Data
PPTX
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
PDF
Cross domain sentiment classification via spectral feature alignment
 
PPTX
movie recommender system using vectorization and SVD tech
PDF
PR-132: SSD: Single Shot MultiBox Detector
PDF
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
PPTX
A new similarity measurement based on hellinger distance for collaborating fi...
PDF
Graph Based Machine Learning with Applications to Media Analytics
Recommender Systems from A to Z – The Right Dataset
Movie recommendation Engine using Artificial Intelligence
Movie lens movie recommendation system
Download
Download
It Takes Two to Tango: an Exploration of Domain Pairs for Cross-Domain Collab...
Segmentation - based Historical Handwritten Word Spotting using document-spec...
Attentive Relational Networks for Mapping Images to Scene Graphs
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Computer Vision Landscape : Present and Future
Rokach-GomaxSlides.pptx
Rokach-GomaxSlides (1).pptx
Building Identity Graphs over Heterogeneous Data
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Cross domain sentiment classification via spectral feature alignment
 
movie recommender system using vectorization and SVD tech
PR-132: SSD: Single Shot MultiBox Detector
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
A new similarity measurement based on hellinger distance for collaborating fi...
Graph Based Machine Learning with Applications to Media Analytics
Ad

Recently uploaded (20)

PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
modul_python (1).pptx for professional and student
PDF
Lecture1 pattern recognition............
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
annual-report-2024-2025 original latest.
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Knowledge Engineering Part 1
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Leprosy and NLEP programme community medicine
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
climate analysis of Dhaka ,Banglades.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
modul_python (1).pptx for professional and student
Lecture1 pattern recognition............
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Supervised vs unsupervised machine learning algorithms
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
[EN] Industrial Machine Downtime Prediction
annual-report-2024-2025 original latest.
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Knowledge Engineering Part 1
.pdf is not working space design for the following data for the following dat...
Leprosy and NLEP programme community medicine
Galatica Smart Energy Infrastructure Startup Pitch Deck
Reliability_Chapter_ presentation 1221.5784
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Clinical guidelines as a resource for EBP(1).pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Domainspecificsubgraph extraction ieee-bigdata2016

  • 1. Harnessing Relationships for Domain-specific Subgraph Extraction: A Recommendation Use Case IEEE International Conference on Big Data Washington D.C.,, USA, 5th – 8th December, 2016 Sarasi Lalithsena Pavan Kapanipathi Amit Sheth sarasi@knoesis.org amit@knoesis.orgkapanipa@us.ibm.com Kno.e.sis Research Center Wright State University Thomas J. Watson Research Center IBM Research Kno.e.sis Research Center Wright State University
  • 2. Knowledge Graphs on the Web • Represent the data in structured format using a graph-based data model 2 Linked Open Data Google Knowledge Graph Schema.org annotation 570M entities and 18B facts > 1000 datasets 6.2M entities and 1B facts 8M entities and 70M facts
  • 3. Knowledge Graph In Action 3 IBM Watson uses YAGO Hierarchy to extract the types Movie recommendation algorithms use DBpedia and Linked MovieDB to determine how two movies are semantically relevant
  • 4. Motivation • Utilizing large cross-domain KGs can get computationally intensive • Existing approaches extract relevant subgraphs by navigating predefined number of hops (2-4) from known domain entities 4 A Movie recommendation system extracts the subgraph by navigating 3 hops using 3072 movies in DBpedia The subgraph encompasses 66% of the DBpedia entities.
  • 5. Motivation • Certain applications are domain-specific and do not require the complete knowledge graphs 5 Transformers (2007) The Terminator Cursed Random Hearts Michael Bay James Cameron Wes Craven Sydney Pollack Action Film Los Angeles director director director director knownFor knownFor deathCity deathCity Relevant for movie recommendation Not relevant for movie recommendation
  • 6. Problem 6 How do we extract the domain-specific subgraph from large cross-domain knowledge graphs without compromising the accuracy?
  • 7. Relationship is the key 7 Transformers (2007) The Terminator Cursed Random Hearts Michael Bay James Cameron Wes Craven Sydney Pollack Action Film Los Angeles director director director director knownFor knownFor deathCity deathCity Relevant for movie recommendation Not relevant for movie recommendation
  • 8. Domain Specificity Measures for Relationships • Association of the relationship with domain entities provides evidence for domain specificity 8 m1 • Relationship director is specific to the movie domain • Relationship country is not specific to the movie domain • Association of the relationship with the domain entities is straightforward with direct relationships such as director and country • However, it is not trivial for other relationships such as award, spouse, and capital Movie
  • 9. Domain Specificity Measures for Relationships • To measure the domain-specificity of both direct and indirect relationships, we identify two characteristics of a dataset: – Entity Type – Property Path • We formalize these two characteristics to calculate domain specificity of a relationship 9
  • 10. Type-based Domain Specificity Measure Measure uses the association between entity types 10 m1 spouse Movie Director Country type type type Strong association Weak association • Strength of association between the domain entity type to the other entity type Association between Movie type and Director type • Strength of association between the entity type to the relationship Association between award relationship and Director type
  • 11. Type-based Domain Specificity Measure • Strength of association between directly connected entity types 𝑑_𝑡𝑦𝑝𝑟𝑒𝑙(𝑡𝑖, 𝑡𝑗) = 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡𝑡𝑖,𝑡 𝑗 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡𝑡𝑖 ∗ 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡𝑡 𝑗 • Strength of association between indirectly connected entity types 𝑖𝑛𝑑_𝑡𝑦𝑝𝑒𝑟𝑒𝑙(𝑡 𝑑, 𝑡 𝑛−1, 𝑛) = 𝑘=1 𝑛−1 𝑑_𝑡𝑦𝑝𝑒𝑟𝑒𝑙(𝑡 𝑘−1, 𝑡 𝑘) • Strength of association between entity types and their direct relationships 𝑝𝑟𝑜𝑝_𝑟𝑒𝑙 𝑝, 𝑡 = 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡 𝑝,𝑡 𝑒𝑑𝑔𝑒_𝑐𝑜𝑢𝑛𝑡 𝑝 𝑝𝑟𝑜𝑝_𝑠𝑐𝑜𝑟𝑒(𝑝, 𝑛) = 𝑡 𝑛−1 𝑗 ∈𝐶 𝑖𝑛𝑑_𝑡𝑦𝑝𝑒𝑟𝑒𝑙 𝑡 𝑑, 𝑡 𝑛−1 𝑗 , 𝑛 ∗ 𝑝𝑟𝑜𝑝_𝑟𝑒𝑙(𝑝, 𝑡 𝑛−1 𝑗 ) 11 D H1 H2 Hn-1 Hnp1 Movie Director Award H3p2 p3 Pn+1pn Between Movie and Director Between Movie and Award Between P3 and Award nth Hop
  • 12. Path-based Domain Specificity Measure Measure uses the association between intermediate relationships 12 m1 I1 m2 I2 I3 I4 I5 m1 m2 I1 I3 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 • Uses an iterative approach by considering already identified domain-specific paths
  • 13. Path-based Domain Specificity Measure • Domain specificity of nth hop relationship depends on domain-specific paths of length n -1 𝑃𝑀𝐼 𝑝, 𝑑𝑠𝑝 𝑛−1 = 𝑙𝑜𝑔 𝑃𝑟𝑜𝑏(𝑝,𝑑𝑠𝑝 𝑛−1) 𝑃𝑟𝑜𝑏(𝑝) ∗𝑃𝑟𝑜𝑏(𝑑𝑠𝑝 𝑛−1) 𝑃𝑟𝑜𝑏 𝑝, 𝑑𝑠𝑝 𝑛−1 = 𝑃𝑎𝑡ℎ(𝑑𝑠𝑝 𝑛−1,𝑝) 𝑝∈𝑃 𝑃𝑎𝑡ℎ(𝑑𝑠𝑝 𝑛−1,𝑝) To address the PMI’s sensitivity to low frequent value, 𝑁𝑃𝑀𝐼 𝑝, 𝑑𝑠𝑝 𝑛−1 = 𝑙𝑜𝑔 𝑃𝑟𝑜𝑏(𝑝,𝑑𝑠𝑝 𝑛−1) 𝑝𝑟𝑜𝑏 𝑝 ∗𝑃𝑟𝑜𝑏(𝑑𝑠𝑝 𝑛−1) − log 𝑃𝑟𝑜𝑏(𝑝, 𝑑𝑠𝑝 𝑛−1) 13 D H1 H2 Hn-1 Hn nth Hop H3 p1 p2 p3 pn Pn + 1 Domain specific paths Between p3 and Domain specific path p1 – p2
  • 14. Evaluation: Recommendation Use Case • Evaluate the effectiveness of the domain-specific subgraph using a recommendation use case • Implement an existing recommendation algorithm and use the n- hop expansion subgraph (baseline) and domain-specific subgraph as the (our approach) • Use two domains Movie and Book with existing dataset MovieLens and DBbook • MovieLens consists of 1,000,209 ratings for 3883 movies by 6,040 users and DBbook 72,372 ratings for 8,170 books by 6181 users 14
  • 15. Evaluation Metrics • Graph reduction – Measure the reduction of the graph with nodes, relationships and reachable paths • Impact on accuracy – Precision@n – Rating Deviation • Impact on run time 15
  • 16. Evaluation Metrics – Graph Reduction Path-based Type-based Relations Nodes Paths Relations Nodes Paths 2-hop 349 1.07M 108.4M 349 1.07M 108.4M DSG2(15,15) 15 (95.7%) 0.08M (92.0%) 5.08M (95.3%) 14 (95.9%) 0.13M (87.6%) 17M (83.9%) DSG2(25,25) 25 (92.8%) 0.13M (87.3%) 17.4M (83.8%) 24 (93.1%) 0.63M (40.9%) 61.6M (43.19%) DSG2(35,35) 35 (90%) 0.64M (40.7%) 61.64M (43.1%) 32 (90.8%) 0.64M (40.7%) 61.62M (43.18%) 16 Movie domain: 2-hop graphs Book domain: 2-hop graphs Path-based Type-based Relations Nodes Paths Relations Nodes Paths 2-hop 424 1.2M 793.4M 424 1.2M 793.4M DSG2(15,15) 15 (96.5%) 0.09M (92.8%) 159.6M (79.6%) 15 (96.5%) 0.09M (92.8%) 159.7M (80%)
  • 17. Evaluation Metrics – Graph Reduction Path-based Type-based Relations Nodes Paths Relations Nodes Paths Movie 3-hop 636 2.86M 4885.3M 636 2.86M 4885.3M DSG3(15,25,15) 30 (95.3%) 0.19M (93.2%) 48.4M (98.9%) 24 (96.2%) 0.26M (90.9%) 105.5M (97.83%) Book 3-hop 641 3.2M 13582.8M 641 3.2M 13582.8M DSG2(15,25,15) 31 (95.2%) 0.18M (94.2%) 1082.6M (92.2%) 21 (96.7%) 0.12M (96%) 1062.5M (92.33%) 17 3-hop graphs In average, domain-specific subgraph has a reduction of 80% to 90% from the n-hop expansion sub graph
  • 18. Evaluation Metrics – Precision@n Movie 2-hop graphs 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑛 = 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑛 16
  • 19. Evaluation Metrics – Precision@n 19 Movie 3-hop graph
  • 20. Evaluation Metrics – Precision@n 20 Book domain
  • 21. Evaluation Metrics – Rating Dev • Rating Dev 𝑟𝑎𝑡𝑖𝑛𝑔𝑑𝑒𝑣 𝑢 = 𝑟∈𝑅 𝑖𝑡𝑒𝑚𝑟𝑎𝑡𝑖𝑛𝑔 𝑟 − 𝑎𝑣𝑔𝑟𝑎𝑡𝑖𝑛𝑔 𝑢 𝑅 21 Movie Rating m1 5 m2 3 m3 3 m4 1 m5 1 Relevant Movies Irrelevant Movies N-hop subgraph m1 m2 DSG m2 m3
  • 22. Evaluation Metrics – AvgDev 22 Movie domain 2-hop 3-hop Baseline DSG2(15,15) Baseline DSG3(15,25,15) 5 0.8222 0.823 0.807 0.823 10 0.814 0.816 0.806 0.815 15 0.810 0.811 0.806 0.811 20 0.806 0.807 0.805 0.806 2-hop 3-hop Baseline DSG2(15,15) Baseline DSG3(15,25,15) 1 0.592 0.584 0.533 0.558 2 0.599 0.604 0.571 0.579 3 0.601 0.614 0.569 0.579 4 0.606 0.617 0.595 0.595 5 0.610 0.620 0.596 0.6 Book domain
  • 23. Evaluation Metrics – Run Time Performance 23 Movie Book n-hop expansion DSG n-hop expansion DSG Path Type Path Type 2-hop 72s 5s 11.2s 10.15m 1.3m 1.4m 3-hop 2 h 35 m 76s 3.2 m 7 h 40 m 15.2m 27m
  • 24. Conclusion • Propose an approach to extract a domain-specific sub graph from a large, cross-domain KG • Treat the non-taxonomical relationships as the first class object • Approach was able to reduce the graph size by more than 80% which led to a tenfold decrease in computation time of the recommendation algorithm • Accuracy of the algorithm shows no compromise rather found more accurate results 24
  • 25. 24 Thank You! http://knoesis.wright.edu/people/sarasi sarasi@knoesis.org Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA
  • 26. Harnessing Relationships for Domain-specific Subgraph Extraction: A Recommendation Use Case Presented at the IEEE International Conference on Big Data (SocInfo 2016) Washington, USA, 5th – 8th December, 2016 Amit Sheth amit@knoesis.org Pavan Kapanipathi kapanipa@us.ibm.com Sarasi Lalithsena sarasi@knoesis.org

Editor's Notes

  • #2: Hi everyone, Welcome to my talk. I am Sarasi Lalithsena a PhD candidate at kno.e.sis research center affiliated to wright state university in Dayton OH. I am going to present our work on harnessing relationships for domain-specific subgraph extraction. This is a joint work with Pavan Kapanipathi from IBM research and my advisor Amt Sheth
  • #3: Recent advances by AI and SW community helped us to generate human and machine readable structured data on the Web which we referred to as Web of data. Couple of examples are Linked open data which is a collection of interlinked structured dataset on the Web. It reported more than 1000 datasets in 2014. Two of the prominent datases on LOD are DBpedia and Wiikidata, DBpedia with 6.2M entities and IB facts automatically extracted from Wikipedia info boxes. Wikidata is a collaboratively edited knowledge base which is known as Wikipedia for data. It contains 8M entities and 70M facts Then there is Google knowledge graph which google used to power their entity search. Also, there are schema.org annotations in web pages also contributing to the Web of data. These are some nice efforts to create the data.
  • #4: These knowledge graphs are being used in number of applications. A well known example is the google uses google KG to enhance their search. Other than that, IBM Watson which is a question answering system in the Jeopy quiz show uses YAGO hierarchy to extract the types of the candidate answers as a one way to rank them. Movie recommendation systems determine how two movies are semantically relevant. For example, if I like watching movies directed by steven Spielberg based on humanistic issues over the past, there is a higher probability of me liking to his new movie bridge of spies. ITasks such as recommendation uses KG to improve the item based similariy.
  • #5: However, utilizing large cross-doman KGs can get computationally intensive. Existing approaches leveraging these graph-based datasets extract the relevant subgraphs by navigating predefined number of hops (2 - 4) from known domain entities. For example movie recommendation system extracts the subgraph by navigating 3 hops using 3072 movies in DBPedia. The extracted subgraph encompasses the 66% of the DBPedia entities as it reaches hop 3. <<This is being supported by the fact that the mean shortest path between entities in DBPedia is around up to 5 hops and navigating 4 can cover a large part of the dataset. >>
  • #6: Not only size having irrelevant portion can connect entities that are not necessary relevant to the domain which can affect the accuracy. Look at this example, Movie Transformers and Terminator is connected because the directors of these movies are known for action films. However, the movie Cursed and Random Hearts is also being connected using the same number of hops as the directors died on the same place. While the first pair can be relevant to the domain, second pair in not.
  • #8: In determining which movie pairs are relevant given the domain, relationship can be key. With respect to this above example, identifying director and knownFor is important to the domain and death is less relevant to the domain will help us to determine which two movie pairs are more relevant. Hene our approach tries to capture the domain specificity of each relationship.
  • #9: We use the association of the relationship with the domain entities provide evidence for domain specificity. While director is specific to the movie domain as it appears mainly with movie entities and appear less with other types of entities, But country appears with other types such as person and organization. Association of the direct relationship with the domain entities is straightforward. However it is not trivial with other indirect relationships such as award, spouse and capital as they are not directly connect with the movie entities.
  • #10: We will present these two formalizations in the following slides followed by the evaluation
  • #11: First measure is the type-based domain specificity measure as it uses the types. The intuition is that, if the association between the domain entity type and other entity type is strong it is highly likely that those relationships adjustant to the other entity type is also domain specific. For example, as type Movie and Director has a strong association, award which is directly connected to Director has a higher likelihood being relevant to the domain compared to capital which is a direct relationship of type County. But this is not enough as you can see then spouse also can be domain-specific. To capture that we use the strength of association of the relationship to the adjust type. In that case, as award appears with the types of Directors while spouse can occur with any person type, award has a higher likelihood being domain-specific.
  • #12: We use these intuition and came up with the type-based measure. First equation measure strength of association between directly connected entity types using the edge count with two types. Then we use that to measure strength of association between indirectly connected entity types by propagating direct associations. Third equation measure the strength of the association between entity types and their direct relationships. Finally we combine all to measure the domain-specificity score for a given relationship.
  • #13: While we use strength of the intermediate types in the type-based measure, we use the association between intermediate relationships in this measure. We use an iterative approach for this. Once we identify the domain-specific relationship in the first hop navigation, we use only those only relationships to reach the next hop as given in the example.
  • #14: Here the intuition is that domain specific of the nth hop elationship depends on the domain-specific paths on lengths n-1. We adopt the known association measure PMI to capture strength of association between the relationship p and domain specific paths on upto length n-1. As PMI is sensitive to the low frequent values, we use the normalized PMI value.
  • #15: Now to evaluate this, we use a recommendation use case. We implement an existing recommendation algorithm and use the n-hop expansion subgraph and the domain-specific subgraph extracted by our approach to evaluate its effectiveness. In generating the domain-specific subgraph, we get the domain-specific measures and use only top-n relationships to generate the domain-specific sub graph. We already did our evaluation on two domains Movie and Book with well known datasets for recommendation.
  • #16: There were couple of aspects we wanted to evaluate. First how much we were able to reduce the graph. We get the number of nodes, relationships and reachable paths from both n-hop expansion subgraph and domain-specific subgraph. Then we want to see the whether graph reduction cost any sacrifices in terms of the accuracy. Finally, we also measure the run time after reducing the graph.
  • #17: Here are the results for graph reduction for the subgraph traversing up to 2 hops for movie and book domain on both path-based and type-based techniques. In the movie domain we show the reduction using different top N to extract the the sub graph (15, 25, 35). In both the cases, we see an average 80% reduction of graphs for nodes, relationships and reachable paths.
  • #18: This shows the result for 3-hop subgraph for both movie and book.
  • #19: Here are the results for precision on two hops subgraph for movie domain on both path based and type based measures. We take different n for precision which is the x-axis. As you can see, for both path-based and type-based measures, domain-specific subgraph selected with top 15 relationships shows some improvement in the precision. As we increate the number of relationships precision goes down and settle down at the baseline performance. This shows that we were able to rank the most relevant relationships higher
  • #20: This the movie hop3 graph for both the measures. And it also shows similar performance to the three hop graphs
  • #21: Here are the results for book domains. In the book domain, we see path-based measure perform in par with the baseline, type-based measure slightly underperform.
  • #22: precision@n does not tell us if the reduced graph replace any highly relevant items with low relevant items. To measure that, we use the ratings given in the gold standard dataset to see whether we replace any highly rated items with low rated items.
  • #23: Here are the results for Average deviation to show we do not replace any highly rated movies with low rated movies. Except book domain top 1 results in all other scenarios we have a higher or similar average deviation which tells that we did not replace any highly rated movies.
  • #24: This the run time performance for 2 hop and 3 hops graph. You can see the significant increase in time for 3-hops expansion and our ability to reduce the time in ten fold from hours to seconds or minutes. VM core-8, with 15G RAM and
  • #25: To conclude, we propose an approach to extract the domain-specific subgraph from a large cross-domain KG by treating non-taxonomic relationships as the first class object. We were able to reduce the graph size by 80% with a tenfold decrease in computation time. Accuracy of the algorithm shows no compromise rather found more accurate results
  • #27: Run time performance with preprocessing to create the subgraph