SlideShare a Scribd company logo
Similarity Search in Information Networks using
Meta-Path Based between Objects
Abstract – Real world physical and abstract data
objects are interconnected, forming enormous,
interconnected networks. By structuring these
data objects and interactions between these
objects into multiple types, such networks
become semi-structured heterogeneous
information networks. Therefore, the quality
analysis of large heterogeneous information
networks poses new challenges. In current
system, a generalized flow based method is
introduced for measuring the relationship on
Wikipedia by reflecting all three concepts:
distance, connectivity and co-citation. By using
the current approach we measure only
relationships between objects rather than
similarities. To address these problems we
introduce a novel solution meta-path based
similarity searching approach for dealing with
heterogeneous information networks using a
meta-path-based method. Under this framework,
similarity search and other mining tasks of the
network structure.
Index terms – similarity search, information
network, and meta-path based, clustering.
I. INTRODUCTION
Similarity search, which aims at locating the
most relevant information for a query in large
collections of datasets, has been widely studied
in many applications. For example, in spatial
database, people are interested in finding the k
nearest neighbors for a given spatial object.
Object similarity is also one of the most
primitive concepts for object clustering and
many other data mining functions.
In a similar context, it is critical to provide
effective similarity search functions in
information networks, to find similar entities for
a given entity. In a network of tagged images
such as flicker, a user may be interested in
search for the most similar pictures for a given
picture. In an e-commerce system, a user would
be interest in search for the most similar
products for a given product. Different attribute-
based similarity search, links play an important
role for similarity search in information
networks, especially when the full information
about attributes for objects is difficult to obtain.
There are a few studies leveraging link
information in networks for similarity search,
but most of these revisions are focused on
homogeneous or bipartite networks such as P-
PageRank and SimRank. These similarity
measures disregard the subtlety of different
types among objects and links. Adoption of such
measures to heterogeneous networks his
significant drawbacks: even if we just want to
compare objects of the same type, going through
link paths of different types leads to rather
317
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in
different semantics meanings, and it makes little
sense to mix them up and measure the similarity
without distinguishing their semantics.
To systematically distinguish the semantics
among paths connecting two objects, we
introduce a meta-path based similarity
framework for objects of the same type in a
heterogeneous network. A meta-path is a
sequence of relations between object types,
which defines a new composite relation between
its starting type and ending type. The meta-path
framework provides a powerful mechanism for a
user to select appropriate similarity semantics,
by choosing a proper meta-path, or learn it from
a set of training examples of similar objects.
The meta-path based similarity framework, and
relate it to two well-known existing link-based
similarity functions for homogeneous
information networks. We define a novel
similarity measure, PathSim that is able to find
peer objects that are not only strongly connected
with each other but also share similar visibility
in the network. Moreover, we propose an
efficient algorithm to support online top-k
queries for such similarity search.
II. A META-PATH BASED SIMILARITY
MEASURE
The similarity between two objects in a link-
based similarity function is determined by how
the objects are connected in a network, which
can be described using paths. In a heterogeneous
information network, due to the heterogeneity of
the types of links, the way to connect two
objects can be much more diverse. The schema
connections represent different relationships
between authors, each having some different
semantic meaning.
Now the questions are, given an arbitrary
heterogeneous information network, is there any
way systematically identify all the possible
connection type between two objects types? In
order to do so, we propose two important
concepts in the following.
a) Network Schema And Meta-Path
First, given a complex heterogeneous
information network, it is necessary to provide
its Meta level description for better
understanding the network. Therefore, we
propose the concept of network scheme to
describe the Meta structure of a network.
The concept of network scheme is similar to that
of the Entity – Relationship model in database
systems, but only captures the entity type and
their binary relations, without considering the
attributes for each Entity type. Network schema
serves as a template for a network, and tells how
many types of objects there are in the network
and where the possible links exist.
b) Bibliographic Scheme and Meta-Path
For the bibliographic network scheme, where an
explicitly shows the direction of a relation.
III. META-PATH BASED SIMILARITY
FRAMEWORK
Given a user-specified meta-path, several
similarity measures can be defined for a pair of
objects, and according to the path instances
318
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in
between them following the met-path. There are
several straightforward measures in the
following.
Path count: the number of path instances
between objects.
Random Walk: s(x, y) is the probability of the
random walk that starts from x and ends with y
following meta-path P, which is the sum of the
probabilities of all the path instances.
Pair wise random walk: for a meta-path P that
can be decomposed into two shorter meta-paths
with the same length is then the pair wise
random walk probability starting from objects x
and y and reaching the same middle object.
In general, we can define a meta-path based
similarity framework for two objects x and y.
Note that P-PageRank and SimRank, two well-
known network similarity functions, are
weighted combinations of random walk measure
or pair wise random walk measure, respectively,
over meta-paths with different lengths in
homogeneous networks. In order to use P-
PageRank and SimRank in heterogeneous
information networks.
a) A Novel Similarity Measure
There have been several similarity measures are
presented and they are partially to either highly
visible objects or highly concentrated objects but
cannot capture the semantics of peer similarity.
However, in many scenarios, finding similar
objects in networks is to find similar peers, such
as finding similar authors based on their fields
and reputation, finding similar actors based on
their movie styles and productivity and finding
similar product.
This motivated us to propose a new, meta-path
based similarity measure, call PathSim that
captures the subtle of peer similarity. The insight
behind it is that two similar peer objects should
not only be strongly connected, but also share
comparable observations. As the relation of peer
should be symmetric, we confine PathSim to
symmetric meta-paths. The calculation of
PathSim between any two objects of the same
type given a certain meta-path involves matrix
multiplication.
In this paper, we only consider the meta-path in
the round trip from, to guarantee its symmetry
and therefore the symmetry of the PathSim
measure.
Properties of PathSim
1. Symmetric.
2. Self-maximum
3. Balance of visibility
Although using meta-path based similarity we
can define similarity between two objects given
any round trip meta-paths.
As primary eigenvectors can be used as
authority ranking of objects, the similarity
between two objects under an infinite meta-path
can be viewed as a measure defined on their
rankings. Two objects with more similar
rankings scores will have higher similarity. In
the next section we discuss online query
processing for ingle meta-path.
319
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in
IV. QUERY PROCESSING FOR SINGLE META-
PATH
Compared with P-PageRank and SimRank, the
calculation is much more efficient, as it is a local
graph measure. But still involves expensive
matrix multiplication operations for top –k
search functions, as we need to calculate the
similarity between a query and every object of
the same type in the network. One possible
solution is to materialize all the meta-paths.
In order to support fast online query processing
for large-scale networks, we propose a
methodology that partially materializes short
length meta-paths and then concatenates them
online to derive longer meta-path-based
similarity. First, a baseline method is proposed,
which computes the similarity between query
object x and all the candidate object y of the
same type. Next, a co-clustering based pruning
method is proposed, which prunes candidate
objects that are not promising according to their
similarity upper bounds. Both algorithms return
exact top-k results the given query.
a) Baseline
Suppose we know that the relation matrix for
meta-path and the diagonal vector in order to get
top-k objects with the highest similarity for the
query, we need to compute the probability of
objects. The straightforward baseline is: (1) first
apply vector matrix multiplication (2) calculate
probability of objects (3) sort the probability of
objects and return top-k list in the final step.
When a large matrix, the vector matrix
computation will be too time consuming to
check every possible object. This will be much
more efficient than Pairwise computation
between the query and all the objects of that
type. We call baseline concatenation algorithm
as PathSim-baseline.
The PathSim-baseline algorithm is still time
consuming if the candidate set is large. The time
complexity of computing PathSim for each
candidate, where is O(d) on average and O(m) in
the worst case. We now propose a co-clustering
based top-k concatenation algorithm, by which
non-promising target objects are dynamically
filtered out to reduce the search space.
b) Co-Clustering-Based Pruning
In the baseline algorithm, the computational
costs involve two factors. First, the more
candidates to check, the more time the algorithm
will take; second, for each candidate, the dot
product of query vector and candidate vector
will at most involve m operations, where m is
the vector length. Based on the intuition, we
propose, we propose a co-clustering-based path
concatenation method, which first generates co-
clusters of two types of objects for partial
relation matrix, then stores necessary statics for
each of the blocks corresponding to different co-
cluster pairs, and then uses the block statistics to
prune the search space. For better picture, we
call cluster of type as target clusters, since the
objects are the targets for the query and call
clusters of type as feature clusters. Since the
objects serve as features to calculate the
similarity between the query and the target
objects. By partitioning into different target
clusters, if a whole target cluster is not similar to
320
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in
the query, then all the objects in the target
cluster are likely not in the final top-k lists and
can be pruned. By partitioning in different
feature clusters, cheaper calculations on the
dimension-reduced query vector and candidate
vectors can be used to derive the similarity
upper bounds. The PathSim-Pruning can
significantly improve the query processing speed
comparing with the baseline algorithm, without
affecting the search quality.
c) Multiple Meta-Paths Combination
In the previous section, we presented algorithms
for similarity search using single meta-path.
Now, we present a solution to combine multiple
meta-paths. The reason why we need to combine
several meta-paths is that each meta-path
provides a unique angle to view the similarity
between objects, and the ground truth may be a
cause of different factors. Some useful guidance
of the weight assignment includes: longer meta-
path utilize more remote relationship and thus
should be assigned with a smaller weight, such
as in P-PageRank and SimRank and meta-paths
with more important relationships should be
assigned with a higher weight. For automatically
determining the weights, users cloud provides
training examples of similar objects to learn the
weights of different meta-paths using learning
algorithm.
V. EXPECTED RESULTS
To show the effectiveness of the PathSim
measure and the efficiency of the proposed
algorithms we use the bibliographic networks
extracted from DBLP and Flicker in the
experiments.
The PathSim algorithm significantly improves
the query processing speed comparing with the
baseline algorithm, without affecting the search
quality.
For additional case studies, we construct a
Flicker network from a subset of the Flicker data
which contains four types of objects such as
images, users, tags, and groups. We have to
show that our algorithms improve similarity
search between object based on the potentiality
and correlation between objects.
VI. CONCLUSION
In this paper we introduced novel similarity
search using meta-path based similarity search
using baseline algorithm and co-clustering based
pruning algorithms to improve the similarity
search based on the strengths and relationships
between objects.
REFERENCES
[1] Jiawei Han, Lise Getoor, Wei Wang,
Johannes Gehrke, Robert Grossman "Mining
Heterogeneous Information Networks
Principles and Methodologies"
[2] Y. Koren, S.C. North, and C. Volinsky,
“Measuring and Extracting Proximity in
Networks,” Proc. 12th ACM SIGKDD Int’l
Conf. Knowledge Discovery and Data
Mining, pp. 245-255, 2006.
[3] M. Ito, K. Nakayama, T. Hara, and S. Nishio,
“Association Thesaurus Construction
Methods Based on Link Co-Occurrence
321
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in
Analysis for Wikipedia,” Proc. 17th ACM
Conf. Information and Knowledge
Management (CIKM), pp. 817-826, 2008.
[4] K. Nakayama, T. Hara, and S. Nishio,
“Wikipedia Mining for an Association Web
Thesaurus Construction,” Proc. Eighth Int’l
Conf. Web Information Systems Eng.
(WISE), pp. 322-334, 2007.
[5] M. Yazdani and A. Popescu-Belis, “A
Random Walk Framework to Compute
Textual Semantic Similarity: A Unified
Model for Three Benchmark Tasks,” Proc.
IEEE Fourth Int’l Conf. Semantic
Computing (ICSC), pp. 424-429, 2010.
[6] R.L. Cilibrasi and P.M.B. Vita´nyi, “The
Google Similarity Distance,” IEEE Trans.
Knowledge and Data Eng., vol. 19, no. 3,
pp. 370-383, Mar. 2007.
[7] G. Kasneci, F.M. Suchanek, G. Ifrim, M.
Ramanath, and G. Weikum, “Naga:
Searching and Ranking Knowledge,” Proc.
IEEE 24th Int’l Conf. Data Eng. (ICDE), pp.
953-962, 2008.
[8] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin,
Network Flows: Theory, Algorithms, and
Applications. Prentice Hall, 1993.
322
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in

More Related Content

PDF
Distributed Link Prediction in Large Scale Graphs using Apache Spark
DOCX
NE7012- SOCIAL NETWORK ANALYSIS
DOCX
On optimizing overlay topologies for search
PPTX
2015 07-tuto3-mining hin
PDF
Sub-Graph Finding Information over Nebula Networks
DOC
Keyword query routing
PDF
Context Sensitive Relatedness Measure of Word Pairs
PDF
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONS
Distributed Link Prediction in Large Scale Graphs using Apache Spark
NE7012- SOCIAL NETWORK ANALYSIS
On optimizing overlay topologies for search
2015 07-tuto3-mining hin
Sub-Graph Finding Information over Nebula Networks
Keyword query routing
Context Sensitive Relatedness Measure of Word Pairs
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONS

What's hot (17)

PDF
Survey on Location Based Recommendation System Using POI
DOC
Stars : a statistical traffic pattern discovery system for manets
PDF
Discovering latent informaion by
PDF
P2P DOMAIN CLASSIFICATION USING DECISION TREE
PDF
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
PDF
Paper id 25201463
PDF
Computing semantic similarity measure between words using web search engine
PDF
Volume 2-issue-6-2016-2020
PPTX
Data Mining: Graph mining and social network analysis
PDF
Analysing literature through the lens of information theory and network science
PDF
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
PDF
Social Data Mining
DOC
Leveraging social networks for p2 p content based file sharing in disconnecte...
PDF
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...
PPTX
Data mining for social media
PDF
Predicting_new_friendships_in_social_networks
PDF
Protein-protein interactions-graph-theoretic-modeling
Survey on Location Based Recommendation System Using POI
Stars : a statistical traffic pattern discovery system for manets
Discovering latent informaion by
P2P DOMAIN CLASSIFICATION USING DECISION TREE
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
Paper id 25201463
Computing semantic similarity measure between words using web search engine
Volume 2-issue-6-2016-2020
Data Mining: Graph mining and social network analysis
Analysing literature through the lens of information theory and network science
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
Social Data Mining
Leveraging social networks for p2 p content based file sharing in disconnecte...
An Efficient Modified Common Neighbor Approach for Link Prediction in Social ...
Data mining for social media
Predicting_new_friendships_in_social_networks
Protein-protein interactions-graph-theoretic-modeling
Ad

Viewers also liked (20)

PPTX
Stanford Law School Presentation on Innovations in Law Firm Model
PPTX
lukman nurhasin
PDF
Coverage report - YuMi launching ceremony
PDF
¡Siempre el mejor precio!
PPT
20101023 mind mapping
PDF
Tech Radar
PDF
BEEP Ofertas Marzo 2014
PPTX
Decisión del consumidor
PPT
Diseño de un Mezclador Basado en Convertidores de Corriente en Tecnología CMO...
PDF
Catálogo BEEP Complementos 2014
PPTX
PDF
Padrões de deploy para DevOps e Entrega Contínua, por Danilo Sato
PDF
Conduct JBoss EAP 6 seminar
PDF
DCN legal and policy presentation
PPT
I brand e i social media: case histories 2006-2013
PDF
Centro cultural
PDF
PDF
A wideband hybrid plasmonic fractal patch nanoantenn
PPTX
General election 2014 : Social Media campaigning on Facebook
Stanford Law School Presentation on Innovations in Law Firm Model
lukman nurhasin
Coverage report - YuMi launching ceremony
¡Siempre el mejor precio!
20101023 mind mapping
Tech Radar
BEEP Ofertas Marzo 2014
Decisión del consumidor
Diseño de un Mezclador Basado en Convertidores de Corriente en Tecnología CMO...
Catálogo BEEP Complementos 2014
Padrões de deploy para DevOps e Entrega Contínua, por Danilo Sato
Conduct JBoss EAP 6 seminar
DCN legal and policy presentation
I brand e i social media: case histories 2006-2013
Centro cultural
A wideband hybrid plasmonic fractal patch nanoantenn
General election 2014 : Social Media campaigning on Facebook
Ad

Similar to Iaetsd similarity search in information networks using (20)

PDF
A Survey On Link Prediction In Social Networks
PDF
G5234552
DOCX
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
DOCX
JPJ1423 Keyword Query Routing
PDF
Based on the Influence Factors in the Heterogeneous Network t-path Similarity...
PDF
Ijetcas14 347
DOC
Poster Abstracts
DOCX
keyword query routing
PDF
Scale-Free Networks to Search in Unstructured Peer-To-Peer Networks
PDF
WEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGES
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
DOCX
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
PDF
Bx32903907
PDF
Annotating Search Results from Web Databases
PDF
An Improved PageRank Algorithm for Multilayer Networks
DOCX
A generalized flow based method for analysis of implicit relationships on wik...
PDF
A Distributed Approach to Solving Overlay Mismatching Problem
PDF
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
PDF
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
A Survey On Link Prediction In Social Networks
G5234552
2014 IEEE JAVA DATA MINING PROJECT Keyword query routing
IEEE 2014 JAVA DATA MINING PROJECTS Keyword query routing
JPJ1423 Keyword Query Routing
Based on the Influence Factors in the Heterogeneous Network t-path Similarity...
Ijetcas14 347
Poster Abstracts
keyword query routing
Scale-Free Networks to Search in Unstructured Peer-To-Peer Networks
WEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGES
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
Bx32903907
Annotating Search Results from Web Databases
An Improved PageRank Algorithm for Multilayer Networks
A generalized flow based method for analysis of implicit relationships on wik...
A Distributed Approach to Solving Overlay Mismatching Problem
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...

More from Iaetsd Iaetsd (20)

PDF
iaetsd Survey on cooperative relay based data transmission
PDF
iaetsd Software defined am transmitter using vhdl
PDF
iaetsd Health monitoring system with wireless alarm
PDF
iaetsd Equalizing channel and power based on cognitive radio system over mult...
PDF
iaetsd Economic analysis and re design of driver’s car seat
PDF
iaetsd Design of slotted microstrip patch antenna for wlan application
PDF
REVIEW PAPER- ON ENHANCEMENT OF HEAT TRANSFER USING RIBS
PDF
A HYBRID AC/DC SOLAR POWERED STANDALONE SYSTEM WITHOUT INVERTER BASED ON LOAD...
PDF
Fabrication of dual power bike
PDF
Blue brain technology
PDF
iirdem The Livable Planet – A Revolutionary Concept through Innovative Street...
PDF
iirdem Surveillance aided robotic bird
PDF
iirdem Growing India Time Monopoly – The Key to Initiate Long Term Rapid Growth
PDF
iirdem Design of Efficient Solar Energy Collector using MPPT Algorithm
PDF
iirdem CRASH IMPACT ATTENUATOR (CIA) FOR AUTOMOBILES WITH THE ADVOCATION OF M...
PDF
iirdem ADVANCING OF POWER MANAGEMENT IN HOME WITH SMART GRID TECHNOLOGY AND S...
PDF
iaetsd Shared authority based privacy preserving protocol
PDF
iaetsd Secured multiple keyword ranked search over encrypted databases
PDF
iaetsd Robots in oil and gas refineries
PDF
iaetsd Modeling of solar steam engine system using parabolic
iaetsd Survey on cooperative relay based data transmission
iaetsd Software defined am transmitter using vhdl
iaetsd Health monitoring system with wireless alarm
iaetsd Equalizing channel and power based on cognitive radio system over mult...
iaetsd Economic analysis and re design of driver’s car seat
iaetsd Design of slotted microstrip patch antenna for wlan application
REVIEW PAPER- ON ENHANCEMENT OF HEAT TRANSFER USING RIBS
A HYBRID AC/DC SOLAR POWERED STANDALONE SYSTEM WITHOUT INVERTER BASED ON LOAD...
Fabrication of dual power bike
Blue brain technology
iirdem The Livable Planet – A Revolutionary Concept through Innovative Street...
iirdem Surveillance aided robotic bird
iirdem Growing India Time Monopoly – The Key to Initiate Long Term Rapid Growth
iirdem Design of Efficient Solar Energy Collector using MPPT Algorithm
iirdem CRASH IMPACT ATTENUATOR (CIA) FOR AUTOMOBILES WITH THE ADVOCATION OF M...
iirdem ADVANCING OF POWER MANAGEMENT IN HOME WITH SMART GRID TECHNOLOGY AND S...
iaetsd Shared authority based privacy preserving protocol
iaetsd Secured multiple keyword ranked search over encrypted databases
iaetsd Robots in oil and gas refineries
iaetsd Modeling of solar steam engine system using parabolic

Recently uploaded (20)

PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Sustainable Sites - Green Building Construction
PDF
Well-logging-methods_new................
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
PPT on Performance Review to get promotions
PPTX
Geodesy 1.pptx...............................................
DOCX
573137875-Attendance-Management-System-original
PPTX
Construction Project Organization Group 2.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
CH1 Production IntroductoryConcepts.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Operating System & Kernel Study Guide-1 - converted.pdf
Sustainable Sites - Green Building Construction
Well-logging-methods_new................
UNIT 4 Total Quality Management .pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
additive manufacturing of ss316l using mig welding
R24 SURVEYING LAB MANUAL for civil enggi
PPT on Performance Review to get promotions
Geodesy 1.pptx...............................................
573137875-Attendance-Management-System-original
Construction Project Organization Group 2.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
CH1 Production IntroductoryConcepts.pptx
Mechanical Engineering MATERIALS Selection
Internet of Things (IOT) - A guide to understanding
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...

Iaetsd similarity search in information networks using

  • 1. Similarity Search in Information Networks using Meta-Path Based between Objects Abstract – Real world physical and abstract data objects are interconnected, forming enormous, interconnected networks. By structuring these data objects and interactions between these objects into multiple types, such networks become semi-structured heterogeneous information networks. Therefore, the quality analysis of large heterogeneous information networks poses new challenges. In current system, a generalized flow based method is introduced for measuring the relationship on Wikipedia by reflecting all three concepts: distance, connectivity and co-citation. By using the current approach we measure only relationships between objects rather than similarities. To address these problems we introduce a novel solution meta-path based similarity searching approach for dealing with heterogeneous information networks using a meta-path-based method. Under this framework, similarity search and other mining tasks of the network structure. Index terms – similarity search, information network, and meta-path based, clustering. I. INTRODUCTION Similarity search, which aims at locating the most relevant information for a query in large collections of datasets, has been widely studied in many applications. For example, in spatial database, people are interested in finding the k nearest neighbors for a given spatial object. Object similarity is also one of the most primitive concepts for object clustering and many other data mining functions. In a similar context, it is critical to provide effective similarity search functions in information networks, to find similar entities for a given entity. In a network of tagged images such as flicker, a user may be interested in search for the most similar pictures for a given picture. In an e-commerce system, a user would be interest in search for the most similar products for a given product. Different attribute- based similarity search, links play an important role for similarity search in information networks, especially when the full information about attributes for objects is difficult to obtain. There are a few studies leveraging link information in networks for similarity search, but most of these revisions are focused on homogeneous or bipartite networks such as P- PageRank and SimRank. These similarity measures disregard the subtlety of different types among objects and links. Adoption of such measures to heterogeneous networks his significant drawbacks: even if we just want to compare objects of the same type, going through link paths of different types leads to rather 317 INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT ISBN: 378 - 26 - 138420 - 5 www.iaetsd.in
  • 2. different semantics meanings, and it makes little sense to mix them up and measure the similarity without distinguishing their semantics. To systematically distinguish the semantics among paths connecting two objects, we introduce a meta-path based similarity framework for objects of the same type in a heterogeneous network. A meta-path is a sequence of relations between object types, which defines a new composite relation between its starting type and ending type. The meta-path framework provides a powerful mechanism for a user to select appropriate similarity semantics, by choosing a proper meta-path, or learn it from a set of training examples of similar objects. The meta-path based similarity framework, and relate it to two well-known existing link-based similarity functions for homogeneous information networks. We define a novel similarity measure, PathSim that is able to find peer objects that are not only strongly connected with each other but also share similar visibility in the network. Moreover, we propose an efficient algorithm to support online top-k queries for such similarity search. II. A META-PATH BASED SIMILARITY MEASURE The similarity between two objects in a link- based similarity function is determined by how the objects are connected in a network, which can be described using paths. In a heterogeneous information network, due to the heterogeneity of the types of links, the way to connect two objects can be much more diverse. The schema connections represent different relationships between authors, each having some different semantic meaning. Now the questions are, given an arbitrary heterogeneous information network, is there any way systematically identify all the possible connection type between two objects types? In order to do so, we propose two important concepts in the following. a) Network Schema And Meta-Path First, given a complex heterogeneous information network, it is necessary to provide its Meta level description for better understanding the network. Therefore, we propose the concept of network scheme to describe the Meta structure of a network. The concept of network scheme is similar to that of the Entity – Relationship model in database systems, but only captures the entity type and their binary relations, without considering the attributes for each Entity type. Network schema serves as a template for a network, and tells how many types of objects there are in the network and where the possible links exist. b) Bibliographic Scheme and Meta-Path For the bibliographic network scheme, where an explicitly shows the direction of a relation. III. META-PATH BASED SIMILARITY FRAMEWORK Given a user-specified meta-path, several similarity measures can be defined for a pair of objects, and according to the path instances 318 INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT ISBN: 378 - 26 - 138420 - 5 www.iaetsd.in
  • 3. between them following the met-path. There are several straightforward measures in the following. Path count: the number of path instances between objects. Random Walk: s(x, y) is the probability of the random walk that starts from x and ends with y following meta-path P, which is the sum of the probabilities of all the path instances. Pair wise random walk: for a meta-path P that can be decomposed into two shorter meta-paths with the same length is then the pair wise random walk probability starting from objects x and y and reaching the same middle object. In general, we can define a meta-path based similarity framework for two objects x and y. Note that P-PageRank and SimRank, two well- known network similarity functions, are weighted combinations of random walk measure or pair wise random walk measure, respectively, over meta-paths with different lengths in homogeneous networks. In order to use P- PageRank and SimRank in heterogeneous information networks. a) A Novel Similarity Measure There have been several similarity measures are presented and they are partially to either highly visible objects or highly concentrated objects but cannot capture the semantics of peer similarity. However, in many scenarios, finding similar objects in networks is to find similar peers, such as finding similar authors based on their fields and reputation, finding similar actors based on their movie styles and productivity and finding similar product. This motivated us to propose a new, meta-path based similarity measure, call PathSim that captures the subtle of peer similarity. The insight behind it is that two similar peer objects should not only be strongly connected, but also share comparable observations. As the relation of peer should be symmetric, we confine PathSim to symmetric meta-paths. The calculation of PathSim between any two objects of the same type given a certain meta-path involves matrix multiplication. In this paper, we only consider the meta-path in the round trip from, to guarantee its symmetry and therefore the symmetry of the PathSim measure. Properties of PathSim 1. Symmetric. 2. Self-maximum 3. Balance of visibility Although using meta-path based similarity we can define similarity between two objects given any round trip meta-paths. As primary eigenvectors can be used as authority ranking of objects, the similarity between two objects under an infinite meta-path can be viewed as a measure defined on their rankings. Two objects with more similar rankings scores will have higher similarity. In the next section we discuss online query processing for ingle meta-path. 319 INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT ISBN: 378 - 26 - 138420 - 5 www.iaetsd.in
  • 4. IV. QUERY PROCESSING FOR SINGLE META- PATH Compared with P-PageRank and SimRank, the calculation is much more efficient, as it is a local graph measure. But still involves expensive matrix multiplication operations for top –k search functions, as we need to calculate the similarity between a query and every object of the same type in the network. One possible solution is to materialize all the meta-paths. In order to support fast online query processing for large-scale networks, we propose a methodology that partially materializes short length meta-paths and then concatenates them online to derive longer meta-path-based similarity. First, a baseline method is proposed, which computes the similarity between query object x and all the candidate object y of the same type. Next, a co-clustering based pruning method is proposed, which prunes candidate objects that are not promising according to their similarity upper bounds. Both algorithms return exact top-k results the given query. a) Baseline Suppose we know that the relation matrix for meta-path and the diagonal vector in order to get top-k objects with the highest similarity for the query, we need to compute the probability of objects. The straightforward baseline is: (1) first apply vector matrix multiplication (2) calculate probability of objects (3) sort the probability of objects and return top-k list in the final step. When a large matrix, the vector matrix computation will be too time consuming to check every possible object. This will be much more efficient than Pairwise computation between the query and all the objects of that type. We call baseline concatenation algorithm as PathSim-baseline. The PathSim-baseline algorithm is still time consuming if the candidate set is large. The time complexity of computing PathSim for each candidate, where is O(d) on average and O(m) in the worst case. We now propose a co-clustering based top-k concatenation algorithm, by which non-promising target objects are dynamically filtered out to reduce the search space. b) Co-Clustering-Based Pruning In the baseline algorithm, the computational costs involve two factors. First, the more candidates to check, the more time the algorithm will take; second, for each candidate, the dot product of query vector and candidate vector will at most involve m operations, where m is the vector length. Based on the intuition, we propose, we propose a co-clustering-based path concatenation method, which first generates co- clusters of two types of objects for partial relation matrix, then stores necessary statics for each of the blocks corresponding to different co- cluster pairs, and then uses the block statistics to prune the search space. For better picture, we call cluster of type as target clusters, since the objects are the targets for the query and call clusters of type as feature clusters. Since the objects serve as features to calculate the similarity between the query and the target objects. By partitioning into different target clusters, if a whole target cluster is not similar to 320 INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT ISBN: 378 - 26 - 138420 - 5 www.iaetsd.in
  • 5. the query, then all the objects in the target cluster are likely not in the final top-k lists and can be pruned. By partitioning in different feature clusters, cheaper calculations on the dimension-reduced query vector and candidate vectors can be used to derive the similarity upper bounds. The PathSim-Pruning can significantly improve the query processing speed comparing with the baseline algorithm, without affecting the search quality. c) Multiple Meta-Paths Combination In the previous section, we presented algorithms for similarity search using single meta-path. Now, we present a solution to combine multiple meta-paths. The reason why we need to combine several meta-paths is that each meta-path provides a unique angle to view the similarity between objects, and the ground truth may be a cause of different factors. Some useful guidance of the weight assignment includes: longer meta- path utilize more remote relationship and thus should be assigned with a smaller weight, such as in P-PageRank and SimRank and meta-paths with more important relationships should be assigned with a higher weight. For automatically determining the weights, users cloud provides training examples of similar objects to learn the weights of different meta-paths using learning algorithm. V. EXPECTED RESULTS To show the effectiveness of the PathSim measure and the efficiency of the proposed algorithms we use the bibliographic networks extracted from DBLP and Flicker in the experiments. The PathSim algorithm significantly improves the query processing speed comparing with the baseline algorithm, without affecting the search quality. For additional case studies, we construct a Flicker network from a subset of the Flicker data which contains four types of objects such as images, users, tags, and groups. We have to show that our algorithms improve similarity search between object based on the potentiality and correlation between objects. VI. CONCLUSION In this paper we introduced novel similarity search using meta-path based similarity search using baseline algorithm and co-clustering based pruning algorithms to improve the similarity search based on the strengths and relationships between objects. REFERENCES [1] Jiawei Han, Lise Getoor, Wei Wang, Johannes Gehrke, Robert Grossman "Mining Heterogeneous Information Networks Principles and Methodologies" [2] Y. Koren, S.C. North, and C. Volinsky, “Measuring and Extracting Proximity in Networks,” Proc. 12th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 245-255, 2006. [3] M. Ito, K. Nakayama, T. Hara, and S. Nishio, “Association Thesaurus Construction Methods Based on Link Co-Occurrence 321 INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT ISBN: 378 - 26 - 138420 - 5 www.iaetsd.in
  • 6. Analysis for Wikipedia,” Proc. 17th ACM Conf. Information and Knowledge Management (CIKM), pp. 817-826, 2008. [4] K. Nakayama, T. Hara, and S. Nishio, “Wikipedia Mining for an Association Web Thesaurus Construction,” Proc. Eighth Int’l Conf. Web Information Systems Eng. (WISE), pp. 322-334, 2007. [5] M. Yazdani and A. Popescu-Belis, “A Random Walk Framework to Compute Textual Semantic Similarity: A Unified Model for Three Benchmark Tasks,” Proc. IEEE Fourth Int’l Conf. Semantic Computing (ICSC), pp. 424-429, 2010. [6] R.L. Cilibrasi and P.M.B. Vita´nyi, “The Google Similarity Distance,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 370-383, Mar. 2007. [7] G. Kasneci, F.M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum, “Naga: Searching and Ranking Knowledge,” Proc. IEEE 24th Int’l Conf. Data Eng. (ICDE), pp. 953-962, 2008. [8] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin, Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993. 322 INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT ISBN: 378 - 26 - 138420 - 5 www.iaetsd.in