Iaetsd similarity search in information networks using

Similarity Search in Information Networks using
Meta-Path Based between Objects
Abstract – Real world physical and abstract data
objects are interconnected, forming enormous,
interconnected networks. By structuring these
data objects and interactions between these
objects into multiple types, such networks
become semi-structured heterogeneous
information networks. Therefore, the quality
analysis of large heterogeneous information
networks poses new challenges. In current
system, a generalized flow based method is
introduced for measuring the relationship on
Wikipedia by reflecting all three concepts:
distance, connectivity and co-citation. By using
the current approach we measure only
relationships between objects rather than
similarities. To address these problems we
introduce a novel solution meta-path based
similarity searching approach for dealing with
heterogeneous information networks using a
meta-path-based method. Under this framework,
similarity search and other mining tasks of the
network structure.
Index terms – similarity search, information
network, and meta-path based, clustering.
I. INTRODUCTION
Similarity search, which aims at locating the
most relevant information for a query in large
collections of datasets, has been widely studied
in many applications. For example, in spatial
database, people are interested in finding the k
nearest neighbors for a given spatial object.
Object similarity is also one of the most
primitive concepts for object clustering and
many other data mining functions.
In a similar context, it is critical to provide
effective similarity search functions in
information networks, to find similar entities for
a given entity. In a network of tagged images
such as flicker, a user may be interested in
search for the most similar pictures for a given
picture. In an e-commerce system, a user would
be interest in search for the most similar
products for a given product. Different attribute-
based similarity search, links play an important
role for similarity search in information
networks, especially when the full information
about attributes for objects is difficult to obtain.
There are a few studies leveraging link
information in networks for similarity search,
but most of these revisions are focused on
homogeneous or bipartite networks such as P-
PageRank and SimRank. These similarity
measures disregard the subtlety of different
types among objects and links. Adoption of such
measures to heterogeneous networks his
significant drawbacks: even if we just want to
compare objects of the same type, going through
link paths of different types leads to rather
317
INTERNATIONAL CONFERENCE ON CURRENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in

different semantics meanings, and it makes little
sense to mix them up and measure the similarity
without distinguishing their semantics.
To systematically distinguish the semantics
among paths connecting two objects, we
introduce a meta-path based similarity
framework for objects of the same type in a
heterogeneous network. A meta-path is a
sequence of relations between object types,
which defines a new composite relation between
its starting type and ending type. The meta-path
framework provides a powerful mechanism for a
user to select appropriate similarity semantics,
by choosing a proper meta-path, or learn it from
a set of training examples of similar objects.
The meta-path based similarity framework, and
relate it to two well-known existing link-based
similarity functions for homogeneous
information networks. We define a novel
similarity measure, PathSim that is able to find
peer objects that are not only strongly connected
with each other but also share similar visibility
in the network. Moreover, we propose an
efficient algorithm to support online top-k
queries for such similarity search.
II. A META-PATH BASED SIMILARITY
MEASURE
The similarity between two objects in a link-
based similarity function is determined by how
the objects are connected in a network, which
can be described using paths. In a heterogeneous
information network, due to the heterogeneity of
the types of links, the way to connect two
objects can be much more diverse. The schema
connections represent different relationships
between authors, each having some different
semantic meaning.
Now the questions are, given an arbitrary
heterogeneous information network, is there any
way systematically identify all the possible
connection type between two objects types? In
order to do so, we propose two important
concepts in the following.
a) Network Schema And Meta-Path
First, given a complex heterogeneous
information network, it is necessary to provide
its Meta level description for better
understanding the network. Therefore, we
propose the concept of network scheme to
describe the Meta structure of a network.
The concept of network scheme is similar to that
of the Entity – Relationship model in database
systems, but only captures the entity type and
their binary relations, without considering the
attributes for each Entity type. Network schema
serves as a template for a network, and tells how
many types of objects there are in the network
and where the possible links exist.
b) Bibliographic Scheme and Meta-Path
For the bibliographic network scheme, where an
explicitly shows the direction of a relation.
III. META-PATH BASED SIMILARITY
FRAMEWORK
Given a user-specified meta-path, several
similarity measures can be defined for a pair of
objects, and according to the path instances
318
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in

between them following the met-path. There are
several straightforward measures in the
following.
Path count: the number of path instances
between objects.
Random Walk: s(x, y) is the probability of the
random walk that starts from x and ends with y
following meta-path P, which is the sum of the
probabilities of all the path instances.
Pair wise random walk: for a meta-path P that
can be decomposed into two shorter meta-paths
with the same length is then the pair wise
random walk probability starting from objects x
and y and reaching the same middle object.
In general, we can define a meta-path based
similarity framework for two objects x and y.
Note that P-PageRank and SimRank, two well-
known network similarity functions, are
weighted combinations of random walk measure
or pair wise random walk measure, respectively,
over meta-paths with different lengths in
homogeneous networks. In order to use P-
PageRank and SimRank in heterogeneous
information networks.
a) A Novel Similarity Measure
There have been several similarity measures are
presented and they are partially to either highly
visible objects or highly concentrated objects but
cannot capture the semantics of peer similarity.
However, in many scenarios, finding similar
objects in networks is to find similar peers, such
as finding similar authors based on their fields
and reputation, finding similar actors based on
their movie styles and productivity and finding
similar product.
This motivated us to propose a new, meta-path
based similarity measure, call PathSim that
captures the subtle of peer similarity. The insight
behind it is that two similar peer objects should
not only be strongly connected, but also share
comparable observations. As the relation of peer
should be symmetric, we confine PathSim to
symmetric meta-paths. The calculation of
PathSim between any two objects of the same
type given a certain meta-path involves matrix
multiplication.
In this paper, we only consider the meta-path in
the round trip from, to guarantee its symmetry
and therefore the symmetry of the PathSim
measure.
Properties of PathSim
1. Symmetric.
2. Self-maximum
3. Balance of visibility
Although using meta-path based similarity we
can define similarity between two objects given
any round trip meta-paths.
As primary eigenvectors can be used as
authority ranking of objects, the similarity
between two objects under an infinite meta-path
can be viewed as a measure defined on their
rankings. Two objects with more similar
rankings scores will have higher similarity. In
the next section we discuss online query
processing for ingle meta-path.
319
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in

IV. QUERY PROCESSING FOR SINGLE META-
PATH
Compared with P-PageRank and SimRank, the
calculation is much more efficient, as it is a local
graph measure. But still involves expensive
matrix multiplication operations for top –k
search functions, as we need to calculate the
similarity between a query and every object of
the same type in the network. One possible
solution is to materialize all the meta-paths.
In order to support fast online query processing
for large-scale networks, we propose a
methodology that partially materializes short
length meta-paths and then concatenates them
online to derive longer meta-path-based
similarity. First, a baseline method is proposed,
which computes the similarity between query
object x and all the candidate object y of the
same type. Next, a co-clustering based pruning
method is proposed, which prunes candidate
objects that are not promising according to their
similarity upper bounds. Both algorithms return
exact top-k results the given query.
a) Baseline
Suppose we know that the relation matrix for
meta-path and the diagonal vector in order to get
top-k objects with the highest similarity for the
query, we need to compute the probability of
objects. The straightforward baseline is: (1) first
apply vector matrix multiplication (2) calculate
probability of objects (3) sort the probability of
objects and return top-k list in the final step.
When a large matrix, the vector matrix
computation will be too time consuming to
check every possible object. This will be much
more efficient than Pairwise computation
between the query and all the objects of that
type. We call baseline concatenation algorithm
as PathSim-baseline.
The PathSim-baseline algorithm is still time
consuming if the candidate set is large. The time
complexity of computing PathSim for each
candidate, where is O(d) on average and O(m) in
the worst case. We now propose a co-clustering
based top-k concatenation algorithm, by which
non-promising target objects are dynamically
filtered out to reduce the search space.
b) Co-Clustering-Based Pruning
In the baseline algorithm, the computational
costs involve two factors. First, the more
candidates to check, the more time the algorithm
will take; second, for each candidate, the dot
product of query vector and candidate vector
will at most involve m operations, where m is
the vector length. Based on the intuition, we
propose, we propose a co-clustering-based path
concatenation method, which first generates co-
clusters of two types of objects for partial
relation matrix, then stores necessary statics for
each of the blocks corresponding to different co-
cluster pairs, and then uses the block statistics to
prune the search space. For better picture, we
call cluster of type as target clusters, since the
objects are the targets for the query and call
clusters of type as feature clusters. Since the
objects serve as features to calculate the
similarity between the query and the target
objects. By partitioning into different target
clusters, if a whole target cluster is not similar to
320
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in

the query, then all the objects in the target
cluster are likely not in the final top-k lists and
can be pruned. By partitioning in different
feature clusters, cheaper calculations on the
dimension-reduced query vector and candidate
vectors can be used to derive the similarity
upper bounds. The PathSim-Pruning can
significantly improve the query processing speed
comparing with the baseline algorithm, without
affecting the search quality.
c) Multiple Meta-Paths Combination
In the previous section, we presented algorithms
for similarity search using single meta-path.
Now, we present a solution to combine multiple
meta-paths. The reason why we need to combine
several meta-paths is that each meta-path
provides a unique angle to view the similarity
between objects, and the ground truth may be a
cause of different factors. Some useful guidance
of the weight assignment includes: longer meta-
path utilize more remote relationship and thus
should be assigned with a smaller weight, such
as in P-PageRank and SimRank and meta-paths
with more important relationships should be
assigned with a higher weight. For automatically
determining the weights, users cloud provides
training examples of similar objects to learn the
weights of different meta-paths using learning
algorithm.
V. EXPECTED RESULTS
To show the effectiveness of the PathSim
measure and the efficiency of the proposed
algorithms we use the bibliographic networks
extracted from DBLP and Flicker in the
experiments.
The PathSim algorithm significantly improves
the query processing speed comparing with the
baseline algorithm, without affecting the search
quality.
For additional case studies, we construct a
Flicker network from a subset of the Flicker data
which contains four types of objects such as
images, users, tags, and groups. We have to
show that our algorithms improve similarity
search between object based on the potentiality
and correlation between objects.
VI. CONCLUSION
In this paper we introduced novel similarity
search using meta-path based similarity search
using baseline algorithm and co-clustering based
pruning algorithms to improve the similarity
search based on the strengths and relationships
between objects.
REFERENCES
[1] Jiawei Han, Lise Getoor, Wei Wang,
Johannes Gehrke, Robert Grossman "Mining
Heterogeneous Information Networks
Principles and Methodologies"
[2] Y. Koren, S.C. North, and C. Volinsky,
“Measuring and Extracting Proximity in
Networks,” Proc. 12th ACM SIGKDD Int’l
Conf. Knowledge Discovery and Data
Mining, pp. 245-255, 2006.
[3] M. Ito, K. Nakayama, T. Hara, and S. Nishio,
“Association Thesaurus Construction
Methods Based on Link Co-Occurrence
321
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in

Analysis for Wikipedia,” Proc. 17th ACM
Conf. Information and Knowledge
Management (CIKM), pp. 817-826, 2008.
[4] K. Nakayama, T. Hara, and S. Nishio,
“Wikipedia Mining for an Association Web
Thesaurus Construction,” Proc. Eighth Int’l
Conf. Web Information Systems Eng.
(WISE), pp. 322-334, 2007.
[5] M. Yazdani and A. Popescu-Belis, “A
Random Walk Framework to Compute
Textual Semantic Similarity: A Unified
Model for Three Benchmark Tasks,” Proc.
IEEE Fourth Int’l Conf. Semantic
Computing (ICSC), pp. 424-429, 2010.
[6] R.L. Cilibrasi and P.M.B. Vita´nyi, “The
Google Similarity Distance,” IEEE Trans.
Knowledge and Data Eng., vol. 19, no. 3,
pp. 370-383, Mar. 2007.
[7] G. Kasneci, F.M. Suchanek, G. Ifrim, M.
Ramanath, and G. Weikum, “Naga:
Searching and Ranking Knowledge,” Proc.
IEEE 24th Int’l Conf. Data Eng. (ICDE), pp.
953-962, 2008.
[8] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin,
Network Flows: Theory, Algorithms, and
Applications. Prentice Hall, 1993.
322
ISBN: 378 - 26 - 138420 - 5
www.iaetsd.in

Iaetsd similarity search in information networks using

More Related Content

What's hot (17)

Viewers also liked (20)

Similar to Iaetsd similarity search in information networks using (20)

More from Iaetsd Iaetsd (20)

Recently uploaded (20)

Iaetsd similarity search in information networks using