SlideShare a Scribd company logo
ANHAI DOAN ALON HALEVY ZACHARY IVES
Chapter 7: Data Matching
PRINCIPLES OF
DATA INTEGRATION
Introduction
 Data matching: find structured data items that refer to
the same real-world entity
 entities may be represented by tuples, XML elements, or RDF
triples, not by strings as in string matching
 e.g., (David Smith, 608-245-4367, Madison WI)
vs (D. M. Smith, 245-4367, Madison WI)
 Data matching arises in many integration scenarios
 merging multiple databases with the same schema
 joining rows from sources with different schemas
 matching a user query to a data item
 One of the most fundamental problems in data
integration 2
Outline
 Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Probabilistic approaches to matching
 Collective matching
 Scaling up data matching
3
Problem Definition
 Given two relational tables X and Y with identical
schemas
 assume each tuple in X and Y describes an entity (e.g., person)
 We say tuple x 2 X matches tuple y 2 Y if they refer to
the same real-world entity
 (x,y) is called a match
 Goal: find all matches between X and Y
4
Example
5
 Other variations
 Tables X and Y have different schemas
 Match tuples within a single table X
 The data is not relational, but XML or RDF
 These are not considered in this chapter (see bib notes)
Why isThis Different than
String Matching?
 In theory, can treat each tuple as a string by
concatenating the fields, then apply string matching
techniques
 But doing so makes it hard to apply sophisticated
techniques and domain-specific knowledge
 E.g., consider matching tuples that describe persons
 suppose we know that in this domain two tuples match if the
names and phone match exactly
 this knowledge is hard to encode if we use string matching
 so it is better to keep the fields apart
6
Challenges
 Same as in string matching
 How to match accurately?
 difficult due to variations in formatting conventions, use of
abbreviations, shortening, different naming conventions,
omissions, nicknames, and errors in data
 several common approaches: rule-based, learning-based,
clustering, probabilistic, collective
 How to scale up to large data sets
 again many approaches have been developed, as we will
discuss
7
Outline
 Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Probabilistic approaches to matching
 Collective matching
 Scaling up data matching
8
Rule-based Matching
 The developer writes rules that specify when two tuples
match
 typically after examining many matching and non-matching
tuple pairs, using a development set of tuple pairs
 rules are then tested and refined, using the same development
set or a test set
 Many types of rules exist, we will consider
 linearly weighted combination of individual similarity scores
 logistic regression combination
 more complex rules
9
Linearly Weighted Combination Rules

10
Example
 sim(x,y) =
0.3sname(x,y) + 0.3sphone(x,y) + 0.1scity(x,y) + 0.3sstate(x,y)
 sname(x,y): based on Jaro-Winkler
 sphone(x,y): based on edit distance between x’s phone (after
removing area code) and y’s phone
 scity(x,y): based on edit distance
 sstate(x,y): based on exact match; yes  1, no  0
11
Pros and Cons
 Pros
 conceptually simple, easy to implement
 can learn weights ®i from training data
 Cons
 an increase ± in the value of any si will cause a linear increase
®i * ± in the value of s
 in certain scenarios this is not desirable, there after a certain
threshold an increase in si should count less (i.e., “diminishing
returns” should kick in)
 e.g., if sname(x,y) is already 0.95 then the two names already
very closely match
 so any increase in sname(x,y) should contribute only minimally
12
Logistic Regression Rules

13
Logistic Regression Rules
 Are also very useful in situations where
 there are many “signals” (e.g., 10-20) that can contribute to
whether two tuples match
 we don’t need all of these signals to “fire” in order to conclude
that the tuples match
 as long as a reasonable number of them fire, we have sufficient
confidence
 Logistic regression is a natural fit for such cases
 Hence is quite popular as a first matching method to try
14
More Complex Rules
 Appropriate when we want to encode more complex
matching knowledge
 e.g., two persons match if names match approximately and
either phones match exactly or addresses match exactly
1. If sname(x,y) < 0.8 then return “not matched”
2. Otherwise if ephone(x,y) = true then return “matched”
3. Otherwise if ecity(x,y) = true and estate(x,y) = true then return
“matched”
4. Otherwise return “not matched”
15
Pros and Cons of
Rule-Based Approaches
 Pros
 easy to start, conceptually relatively easy to understand,
implement, debug
 typically run fast
 can encode complex matching knowledge
 Cons
 can be labor intensive, it takes a lot of time to write good rules
 can be difficult to set appropriate weights
 in certain cases it is not even clear how to write rules
 learning-based approaches address these issues
16
Outline
 Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Probabilistic approaches to matching
 Collective matching
 Scaling up data matching
17
Learning-based Matching
 Here we consider supervised learning
 learn a matching model M from training data, then apply M to
match new tuple pairs
 will consider unsupervised learning later
 Learning a matching model M (the training phase)
 start with training data: T = {(x1,y1,l1), … (xn,yn,ln)}, where each
(xi,yi) is a tuple pair and li is a label: “yes” if xi matches yi and
“no” otherwise
 define a set of features f1, …, fm, each quantifying one aspect of
the domain judged possibly relevant to matching the tuples
18
Learning-based Matching
 Learning a matching model M (continued)
 convert each training example (xi,yi,li) in T to a pair
(<f1(xi,yi), …, fm(xi,yi)>, ci)
 vi = <f1(xi,yi), …, fm(xi,yi)> is a feature vector that encodes (xi,yi) in terms
of the features
 ci is an appropriately transformed version of label l_i (e.g., yes/no or
1/0, depending on what matching model we want to learn)
 thus T is transformed into T’ = {(v1,c1), …, (vn,cn)}
 apply a learning algorithm (e.g. decision trees, SVMs) to T’ to
learn a matching model M
19
Learning-based Matching
 Applying model M to match new tuple pairs
 given pair (x,y), transform it into a feature vector
 v = <f1(x,y), …, fm(x,y)>
 apply M to v to predict whether x matches y
20
Example:
Learning a Linearly Weighted Rule
 s1 and s2 use Jaro-Winkler and edit distance
 s3 uses edit distance (ignoring area code of a)
 s4 and s5 return 1 if exact match, 0 otherwise
 s6 encodes a heuristic constraint 21
Example:
Learing a Linearly Weighted Rule

22
Example: Learning a DecisionTree
23
Now the labels are
yes/no, not 1/0
The Pros and Cons of
Learning-based Approach
 Pros compared to rule-based approaches
 in rule-based approaches must manually decide if a particular
feature is useful  labor intensive and limit the number of
features we can consider
 learning-based ones can automatically examine a large number
of features
 learning-based approaches can construct very complex “rules”
 Cons
 still require training examples, in many cases a large number of
them, which can be hard to obtain
 clustering addresses this problem
24
Outline
 Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Probabilistic approaches to matching
 Collective matching
 Scaling up data matching
25
Matching by Clustering
 Many common clustering techniques have been used
 agglomerative hierarchical clustering (AHC), k-means, graph-
theoretic, …
 here we focus on AHC, a simple yet very commonly used one
 AHC
 partitions a given set of tuples D into a set of clusters
 all tuples in a cluster refer to the same real-world entity, tuples in
different clusters refer to different entities
 begins by putting each tuple in D into a single cluster
 iteratively merges the two most similar clusters
 stops when a desired number of clusters has been reached, or
until the similarity between two closest clusters falls below a
pre-specified threshold 26
Example
 sim(x,y) =
0.3sname(x,y) + 0.3sphone(x,y) + 0.1scity(x,y) + 0.3sstate(x,y)
27
Computing a Similarity Score
betweenTwo Clusters
 Let c and d be two clusters
 Single link: s(c,d) = minxi2c, yj2d sim(xi, yj)
 Complete link: s(c,d) = maxxi2c, yj2d sim(xi, yj)
 Average link: s(c,d) = [xi2c, yj2d sim(xi, yj)] /
[# of (xi,yj) pairs]
 Canonical tuple
 create a canonical tuple that represents each cluster
 sim between c and d is the sim between their canonical tuples
 canonical tuple is created from attribute values of the tuples
 e.g., “Mike Williams” and “M. J. Williams”  “Mike J. Williams”
 (425) 247 4893 and 247 4893  (425) 247 4893
28
Key Ideas underlying
the Clustering Approach
 View matching tuples as the problem of constructing
entities (i.e., clusters)
 The process is iterative
 leverage what we have known so far to build “better” entities
 In each iteration merge all matching tuples within a
cluster to build an “entity profile”, then use it to match
other tuples  merging then exploiting the merged
information to help matching
 These same ideas appear in subsequent approaches that
we will cover
29
Outline
 Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Probabilistic approaches to matching
 Collective matching
 Scaling up data matching
30
Probabilistic Approaches to Matching
 Model matching domain using a probability distribution
 Reason with the distribution to make matching decisions
 Key benefits
 provide a principled framework that can naturally incorporate a
variety of domain knowledge
 can leverage the wealth of prob representation and reasoning
techniques already developed in the AI and DB communities
 provide a frame of reference for comparing and explaining
other matching approaches
 Disadvantages
 computationally expensive
 often hard to understand and debug matching decisions 31
What We Discuss Next
 Most current probabilistic approaches employ generative
models
 these encode full prob distributions and describe how to
generate data that fit the distributions
 Some newer approaches employ discriminative models
(e.g., conditional random fields)
 these encode only the probabilities necessary for matching
(e.g., the probability of a label given a tuple pair)
 Here we focus on generative model based approaches
 first we explain Bayesian networks, a simple type of generative
models
 then we use them to explain more complex ones
32
Bayesian Networks: Motivation
 Let X = {x1, …, xn} be a set of variables
 e.g., X = {Cloud, Sprinkler}
 A state = an assignment of values to all variables in X
 e.g., s = {Cloud = true, Sprinkler = on}
 A probability distribution P assigns to each state si a
value P(si) such that  si2S P(si) = 1
 S is the set
of all states
 P(si) is called
the probability of si
33
Bayesian Networks: Motivation
 Reasoning with prob models: to answer queries such as
 P(A = a)? P(A = a|B = b) = ? where A and B are subsets of vars
 Examples
 P(Cloud = t) = 0.6
(by summing over first two rows)
 P(Cloud = t | Sprinkler = off)
= 0.75
 Problems: can’t enumerate all states, too many of them
 real-world apps often use hundreds or thousands of variables
 Bayesian networks solve this by providing a compact
representation of a probability distribution
34
Baysian Networks: Representation
 nodes = variables, edges = probabilistic dependencies
 Key assertion: each node is probabilistically independent
of its non-descendants given the values of its parents
 e.g., WetGrass is independent of Cloud given Sprinkler & Rain
 Sprinkler is independent of Rain given Cloud
35
Baysian Networks: Representation
 The key assertation allows us to write
 P(C,S,R,W) = P(C).P(S|C).P(R|C).P(W|R)
 Thus, to compute P(C,S,R,W), need to know only four local
probability distributions, also called conditional probability
tables (CPTs)
 use only 9 statements to specify the full PD, instead of 16 36
Bayesian Networks: Reasoning
 Also called performing inference
 computing P(A) or P(A|B), where A and B are subsets of vars
 Performing exact inference is NP-hard
 taking time exponential in number of variables in worst case
 Data matching approaches address this in three ways
 for certain classes of BNs there are polynomial-time algorithms
or closed-form equations that return exact answers
 use standard approximate inference algorithms for BNs
 develop approximate algorithms tailored to the domain at hand
37
Learning Bayesian Networks
 To use a BN, current data matching approaches
 typically require a domain expert to create the graph
 then learn the CPTs from training data
 Training data: set of states we have observed
 e.g., d1 = (Cloud=t, Sprinkler=off, Rain=t, WetGrass=t)
d2 = (Cloud=t, Sprinkler=off, Rain=f, WetGrass=f)
d3 = (Cloud=f, Sprinkler=on, Rain=f, WetGrass=t)
 Two cases
 training data has no missing values
 training dta has some missing values
 greatly complicates learning, must use EM algorithm
 we now consider them in turn
38
Learning with No MissingValues
 d1 = (1,0) means A = 1 and B = 0
39
Learning with No MissingValues
 Let µ be the probabilities to be learned. Want to find µ*
that maximizes the prob of observing the training data D
 µ* = arg maxµ P(D|µ)
 µ* can be obtained by simple counting over D
 E.g., to compute P(A = 1): count # of examples where A =
1, divide by total # of examples
 To compute P(B = 1 | A = 1): divide # of examples where
B = 1 and A =1 by # of examples where A = 1
 What if not having sufficient data for certain states?
 e.g., need to compute P(B=1|A=1), but # states where A = 1 is 0
 need smoothing of the probabilities (see notes)
40
Learning with MissingValues
 Training examples may have missing values
 d = (Cloud=?, Sprinkler=off, Rain=?, WetGrass=t)
 Why?
 we failed to observe a variable
 e.g., slept and did not observe whether it rained
 the variable by its nature is unobservable
 e.g., werewolves who only get out during dark moonless night
 can’t never tell if the sky is cloudy
 Can’t use counting as before to learn (e.g., infer CPTs)
 Use EM algorithm
41
The Expectation-Maximization (EM)
Algorithm
 Key idea:
 two unknown quantities: theta and missing values in D
 iteratively estimates these two, by assigning initial values, then
using one to predict the other and vice versa, until convergence
42
An Example
 EM also aims to find µ that maximizes P(D|µ)
 just like the counting approach in case of no missing values
 It may not find the globally maximal µ*
 converging instead to a local maximum 43
Bayesian Networks
as Generative Models
 Generative models
 encode full probability distributions
 specify how to generate data that fit such distributions
 Bayesian networks: well-known examples of such models
 A perspective on how the data is generated helps
 guide the construction of the Bayesian network
 discover what kinds of domain knowledge to be naturally
incorporated into the network structure
 explain the network to users
 We now examine three prob approaches to matching
that employ increasingly complex generative models
44
Data Matching with Naïve Bayes
 Define variable M that represents whether a and b match
 Our goal is to compute P(M|a,b)
 declare a and b matched if P(M=t|a,b) > P(M=f|a,b)
 Assume P(M|a,b) depends only on S1, …, Sn, features
that are functions that take as input a and b
 e.g., whether two last names match, edit distance between soc
sec numbers, whether the first initials match, etc.
 P(M|a,b) = P(M|S1, …, Sn), using Bayes Rule, we have
 P(M|S1, …, Sn) = P(S1, …, Sn|M)P(M)/P(S1, …, Sn)
45
Data Matching with Naïve Bayes

46
The Naïve Bayes Model
 The assumption that S1, …, Sn are independent of one
another given M is called the Naïve Bayes assumption
 which often does not hold in practice
 Computing P(M|S1, …, Sn) is performing an inference on
the above Bayesian network
 Given the simple form of the network, this inference can
be performed easily, if we know the CPTs
47
Learning the CPTs GivenTraining Data

48
Learning the CPTs Given
NoTraining Data
 Assume (a4,b4), …, (a6,b6) are tuple pairs to be matched
 Convert these pairs into training data with missing values
 the missing value is the correct label for each pair (i.e., the
value for variable M: “matched”, “not matched”)
 Now apply EM algorithm to learn both the CPTs and the
missing values at the same time
 once learned, the missing values are the labels (i.e., “matched”,
“not matched”) that we want to see
49
Summary
 The developer specifies the network structure, i.e., the
directed acyclic graph
 which is a Naïve Bayesian network structure in this case
 If given training data in form of tuple pairs together with
their correct labels (matched, not matched), we can learn
the CPTs of the Naïve Bayes network using counting
 then we use the trained network to match new tuple pairs
(which means performing exact inferences to compute
P(M|a,b))
 People also refer to the Naïve Bayesian network as a
Naïve Bayesian classifier
50
Summary (cont.)
 If no training data is given, but we are given a set of tuple
pairs to be matched, then we can use these tuple pairs to
construct training data with missing values
 we then apply EM to learn the missing values and the CPTs
 the missing values are the match predictions that we want
 The above procedures (for both cases of having and not
having training data) can be generalized in a
straightforward fashion to arbitrary Bayesian network
cases, not just Naïve Bayesian ones
51
Modeling Feature Correlations
 Naïve Bayes assumes no correlations among S1, …, Sn
 We may want to model such correlations
 e.g., if S1 models whether soc sec numbers match, and S3
models whether last names match, then there exists a
correlation between the two
 We can then train and apply this more
expressive BN to match data
 Problem: “blow up” the number of
probs in the CPTs
 assume n is # of features, q is the # of
parents per node, and d is the # of values per node  O(ndq)
vs. 2dn for the comparable Naïve Bayesian
52
Modeling Feature Correlations
 A possible solution
 assume each tuple has k attributes
 consider only k features S1, …, Sk,
the i-th feature compares only values of
the i-th attributes
 introduce binary variables Xi, Xi models whether the i-th
attributes should match, given that the tuples match
 then model correlation only at the Xi level, not at Si level
 This requires far fewer probs in CPTs
 assume each node has q parents, and each S_i has d vallues,
then we need O(k2q + 2kd) probs
53
Key Lesson
 Constructing a BN for a matching problem is an art that
must consider the trade-offs among many factors
 how much domain knowledge to be captured
 how accurately we can learn the network
 how efficiently we can do so
 The notes present an even more complex example about
matching mentions of entities in text
54
Outline
 Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Probabilistic approaches to matching
 Collective matching
 Scaling up data matching
55
Collective Matching
 Matching approaches discussed so far make independent
matching decisions
 decide whether a and b match independently of whether any
two other tuples c and d match
 Matching decisions hower are often correlated
 exploiting such correlations can improve matching accuracy
56
An Example
57
 Goal: match authors of the four papers listed above
 Solution
 extract their names to create the table above
 apply current approaches to match tuples in table
 This fails to exploit co-author relationships in the data
An Example (cont.)
 nodes = authors, hyperedges connect co-authors
 Suppose we have matched a3 and a5
 then intuitively a1 and a4 should be more likely to match
 they share the same name and same co-author relationship to
the same author 58
An Example (cont.)
 First solution:
 add coAuthors attribute to the tuples
 e.g., tuple a_1 has coAuthors = {C. Chen, A. Ansari}
 tuple a_4 has coAuthors = {A. Ansari}
 apply current methods, use say Jaccard measure for coAuthors
59
An Example (cont.)
 Problem:
 suppose a3: A. Ansari and a5: A. Ansari share same name but do
not match
 we would match them, and incorrectly boost score of a1 and a4
 How to fix this?
 want to match a3 and a5, then use that info to help match a1
and a4; also want to do the opposite
 so should match tuples collectively, all at once and iteratively
60
Collective Matching using Clustering
 Many collective matching approaches exist
 clustering-based, probabilistic, etc.
 Here we consider clustering-based (see notes for more)
 Assume input is graph
 nodes = tuples to be matched
 edges = relationships among tuples
61
Collective Matching using Clustering
 To match, perform agglomerative hierarchical clustering
 but modify sim measure to consider correlations among tuples
 Let A and B be two clusters of nodes, define
 sim(A,B) = ® * simattributes(A,B) + (1- ®) * simneighbors(A,B)
 ® is pre-defined weight
 simattributes(A,B) uses only attributes of A and B, examples of
such scores are single link, complete link, average link, etc.
 simneighbors(A,B) considers correlations
 we discuss it next
62
An Example of simneighbors(A,B)
 Assume a single relationship R on the graph edges
 can generalize to the case of multiple relationships
 Let N(A) be the bags of the cluster IDs of all nodes that
are in relationship R with some node in A
 e.g., cluster A has two nodes a and a’,
a is in relationship R with node b with cluster ID 3, and
a’ is in relationship R with node b’ with cluster ID 3
and another node b’’ with cluster ID 5
 N(A) = {3, 3, 5}
 Define simneighbors(A,B) =
Jaccard(N(A),N(B)) = |N(A) Å N(B)| / |N(A) [ N(B)|
63
An Example of simneighbors(A,B)
 Recall that earlier we also define a Jaccard measure as
 JaccardSimcoAuthors(a,b) =
|coAuthors(a) Å coAuthors(b)| / |coAuthors(a) [ coAuthors(b)|
 Contrast that to
 simneighbors(A,B) =
Jaccard(N(A),N(B)) = |N(A) Å N(B)| / |N(A) [ N(B)|
 In the former, we assume two co-authors match if their
“strings” match
 In the latter, two co-authors match only if they have the
same cluster ID
64
An Example to Illustrate the Working of
Agglomerative Hierarchical Clustering
65
Outline
 Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Probabilistic approaches to matching
 Collective matching
 Scaling up data matching
66
Scaling up Rule-based Matching
 Two goals: minimize # of tuple pairs to be matched and
minimize time it takes to match each pair
 For the first goal:
 hashing
 sorting
 indexing
 canopies
 using representatives
 combining the techniques
 Hashing
 hash tuples into buckets, match only tuples within each bucket
 e.g., hash house listings by zipcode, then match within each zip
67
Scaling up Rule-based Matching
 Sorting
 use a key to sort tuples, then scan the sorted list and match
each tuple with only the previous (w-1) tuples, where w is a
pre-specified window size
 key should be strongly “discriminative”: brings together tuples
that are likely to match, and pushes apart tuples that are not
 example keys: soc sec, student ID, last name, soundex value of last name
 employs a stronger heuristic than hashing: also requires that
tuples likely to match be within a window of size w
 but is often faster than hashing because it would match fewer pairs
68
Scaling up Rule-based Matching
 Indexing
 index tuples such that given any tuple a, can use the index to
quickly locate a relatively small set of tuples that are likely to
match a
 e.g., inverted index on names
 Canopies
 use a computationally cheap sim measure to quickly group
tuples into overlapping clusters called canopines (or umbrella
sets)
 use a different (far more expensive) sim measure to match
tuples within each canopy
 e.g., use TF/IDF to create canopies
69
Scaling up Rule-based Matching
 Using representatives
 applied during the matching process
 assigns tuples that have been matched into groups such that
those within a group match and those across groups do not
 create a representative for each group by selecting a tuple in
the group or by merging tuples in the group
 when considering a new tuple, only match it with the
representatives
 Combining the techniques
 e.g., hash houses into buckets using zip codes, then sort houses
within each bucket using street names, then match them using
a sliding window
70
Scaling up Rule-based Matching
 For the second goal of minimizing time it takes to match
each pair
 no well-established technique as yet
 tailor depending on the application and the matching approach
 e.g., if using a simple rule-based approach that matches
individual attributes then combines their scores using weights
 can use short circuiting: stop the computation of the sim score if it is
already so high that the tuple pair will match even if the remaining
attributes do not match
71
Scaling up Other Matching Methods
 Learning, clustering, probabilistic, and collective
approaches often face similar scalability challenges,
and can benefit from the same solutions
 Probabilistic approaches raise additional challenges
 if model has too many parameters  difficult to learn
efficiently, need a large # of training data to learn accurately
 make independence assumptions to reduce # of parameters
 Once learned, inference with model is also time costly
 use approximate inference algorithms
 simplify model so that closed form equations exist
 EM algorithm can be expensive
 truncate EM, or initializing it as accurately as possible 72
Scaling up Using Parallel Processing
 Commonly done in practice
 Examples
 hash tuples into buckets, then match each bucket in parallel
 match tuples against a taxonomy of entities (e.g., a product or
Wikipedia-like concept taxonomy) in parallel
 two tuples are declared matched if they match into
the same taxonomic node
 a variant of using representatives to scale up, discussed earlier
73
Summary
 Critical problem in data integration
 Huge amount of work in academia and industry
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Probabilistic approaches to matching
 Collective matching
 This chapter has covered only the most common and
basic approaches
 The bibliography discusses much more
74

More Related Content

PDF
Fundamentals of Data Structures in C++ - Ellis Horowitz, Sartaj Sahni
PPTX
optimization of DFA
PPT
minimum spanning trees Algorithm
PDF
Longest common subsequence
PPTX
Group theory
PPTX
TOPOLOGY and TYPES OF TOPOLOGY PowerPoint
PPTX
Numerical analysis ppt
PPT
Graceful labelings
Fundamentals of Data Structures in C++ - Ellis Horowitz, Sartaj Sahni
optimization of DFA
minimum spanning trees Algorithm
Longest common subsequence
Group theory
TOPOLOGY and TYPES OF TOPOLOGY PowerPoint
Numerical analysis ppt
Graceful labelings

What's hot (20)

PPTX
TOPOLOGY
PPTX
Numerical Analysis and Its application to Boundary Value Problems
PPTX
Tokenization using nlp | NLP Course
PPT
Hashing
PPT
graph.ppt
PDF
Liner algebra-vector space-1 introduction to vector space and subspace
PPTX
Bipartite graph
PPT
Max flow min cut
PPT
Hash tables
PPT
Asymptotic Notation and Complexity
PPT
Analysis Of Algorithms - Hashing
PDF
Group Theory
PPTX
8 queen problem
PPTX
Fuzzy graph
PDF
Discrete-Chapter 11 Graphs Part I
PDF
Approximation Algorithms
PPT
Loop invariant computation
PPT
Predicate calculus
PPTX
Graph Application in Traffic Control
PPTX
Linear algebra and vector analysis presentation
TOPOLOGY
Numerical Analysis and Its application to Boundary Value Problems
Tokenization using nlp | NLP Course
Hashing
graph.ppt
Liner algebra-vector space-1 introduction to vector space and subspace
Bipartite graph
Max flow min cut
Hash tables
Asymptotic Notation and Complexity
Analysis Of Algorithms - Hashing
Group Theory
8 queen problem
Fuzzy graph
Discrete-Chapter 11 Graphs Part I
Approximation Algorithms
Loop invariant computation
Predicate calculus
Graph Application in Traffic Control
Linear algebra and vector analysis presentation
Ad

Similar to Data matching.ppt (20)

PDF
Machine learning (4)
PDF
PDF
Cs229 notes4
PDF
GUC_2744_59_29307_2023-02-22T14_07_02.pdf
PDF
Data mining knowledge representation Notes
PPT
Introduction to Machine Learning.
PDF
Regression on gaussian symbols
PDF
Machine Learning meets DevOps
PPTX
Path &amp; application(ds)2
PDF
Machine learning (1)
PDF
Chapter1.pdf this is the first chapter of the book, will share
PPTX
Stochastic Processes Homework Help
DOC
Discovering Novel Information with sentence Level clustering From Multi-docu...
PPT
Fusing semantic data
PPTX
Stochastic Processes Homework Help
PPT
MachineLearning.ppt
PPT
MachineLearning.ppt
PPT
MachineLearning.ppt
PPT
lecture_mooney.ppt
PPT
New_ML_Lecture_9.ppt
Machine learning (4)
Cs229 notes4
GUC_2744_59_29307_2023-02-22T14_07_02.pdf
Data mining knowledge representation Notes
Introduction to Machine Learning.
Regression on gaussian symbols
Machine Learning meets DevOps
Path &amp; application(ds)2
Machine learning (1)
Chapter1.pdf this is the first chapter of the book, will share
Stochastic Processes Homework Help
Discovering Novel Information with sentence Level clustering From Multi-docu...
Fusing semantic data
Stochastic Processes Homework Help
MachineLearning.ppt
MachineLearning.ppt
MachineLearning.ppt
lecture_mooney.ppt
New_ML_Lecture_9.ppt
Ad

More from Kalpana Mohan (17)

PPTX
Data Structures and Algorithms_Updated.pptx
PPTX
Data Structures An Introduction.pptx
PPTX
3. Key Components and Patterns of SOA.pptx
PPTX
2. SOA ARCHITECTURE UNIT 1 - 2024-25.pptx
PDF
INTRODUCTION TO FORMS OF SERVICES AND ITS LIFE CYCLE
PPTX
FRAGMENTS AND MULTIPLATFORM SUPPORT.pptx
PPTX
Introduction to Internet of Things .pptx
PPTX
Binary tree operations in data structures
PPTX
Software Engineering Introduction -UNIT 1.pptx
PPTX
Data structures linked list introduction.pptx
PPTX
Artificial Intelligence Alpha Beta pruning.pptx
PPTX
Constrain satisfaction in artificial intelligence.pptx
PDF
UNIT 1-VISUAL EFFECTS INTRODUCTION-07.02.24.pdf
PPTX
Sorting Techniques for Data Structures.pptx
PPTX
Bubble sort, Selection sort SORTING .pptx
PDF
stack-111104232459-phpapp02.pdf
PPT
india Indus Valley culture.ppt
Data Structures and Algorithms_Updated.pptx
Data Structures An Introduction.pptx
3. Key Components and Patterns of SOA.pptx
2. SOA ARCHITECTURE UNIT 1 - 2024-25.pptx
INTRODUCTION TO FORMS OF SERVICES AND ITS LIFE CYCLE
FRAGMENTS AND MULTIPLATFORM SUPPORT.pptx
Introduction to Internet of Things .pptx
Binary tree operations in data structures
Software Engineering Introduction -UNIT 1.pptx
Data structures linked list introduction.pptx
Artificial Intelligence Alpha Beta pruning.pptx
Constrain satisfaction in artificial intelligence.pptx
UNIT 1-VISUAL EFFECTS INTRODUCTION-07.02.24.pdf
Sorting Techniques for Data Structures.pptx
Bubble sort, Selection sort SORTING .pptx
stack-111104232459-phpapp02.pdf
india Indus Valley culture.ppt

Recently uploaded (20)

PPTX
OOP with Java - Java Introduction (Basics)
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Well-logging-methods_new................
PPTX
Geodesy 1.pptx...............................................
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
DOCX
573137875-Attendance-Management-System-original
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Sustainable Sites - Green Building Construction
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
OOP with Java - Java Introduction (Basics)
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
bas. eng. economics group 4 presentation 1.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Well-logging-methods_new................
Geodesy 1.pptx...............................................
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
573137875-Attendance-Management-System-original
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Sustainable Sites - Green Building Construction
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CYBER-CRIMES AND SECURITY A guide to understanding
R24 SURVEYING LAB MANUAL for civil enggi
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...

Data matching.ppt

  • 1. ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 7: Data Matching PRINCIPLES OF DATA INTEGRATION
  • 2. Introduction  Data matching: find structured data items that refer to the same real-world entity  entities may be represented by tuples, XML elements, or RDF triples, not by strings as in string matching  e.g., (David Smith, 608-245-4367, Madison WI) vs (D. M. Smith, 245-4367, Madison WI)  Data matching arises in many integration scenarios  merging multiple databases with the same schema  joining rows from sources with different schemas  matching a user query to a data item  One of the most fundamental problems in data integration 2
  • 3. Outline  Problem definition  Rule-based matching  Learning- based matching  Matching by clustering  Probabilistic approaches to matching  Collective matching  Scaling up data matching 3
  • 4. Problem Definition  Given two relational tables X and Y with identical schemas  assume each tuple in X and Y describes an entity (e.g., person)  We say tuple x 2 X matches tuple y 2 Y if they refer to the same real-world entity  (x,y) is called a match  Goal: find all matches between X and Y 4
  • 5. Example 5  Other variations  Tables X and Y have different schemas  Match tuples within a single table X  The data is not relational, but XML or RDF  These are not considered in this chapter (see bib notes)
  • 6. Why isThis Different than String Matching?  In theory, can treat each tuple as a string by concatenating the fields, then apply string matching techniques  But doing so makes it hard to apply sophisticated techniques and domain-specific knowledge  E.g., consider matching tuples that describe persons  suppose we know that in this domain two tuples match if the names and phone match exactly  this knowledge is hard to encode if we use string matching  so it is better to keep the fields apart 6
  • 7. Challenges  Same as in string matching  How to match accurately?  difficult due to variations in formatting conventions, use of abbreviations, shortening, different naming conventions, omissions, nicknames, and errors in data  several common approaches: rule-based, learning-based, clustering, probabilistic, collective  How to scale up to large data sets  again many approaches have been developed, as we will discuss 7
  • 8. Outline  Problem definition  Rule-based matching  Learning- based matching  Matching by clustering  Probabilistic approaches to matching  Collective matching  Scaling up data matching 8
  • 9. Rule-based Matching  The developer writes rules that specify when two tuples match  typically after examining many matching and non-matching tuple pairs, using a development set of tuple pairs  rules are then tested and refined, using the same development set or a test set  Many types of rules exist, we will consider  linearly weighted combination of individual similarity scores  logistic regression combination  more complex rules 9
  • 11. Example  sim(x,y) = 0.3sname(x,y) + 0.3sphone(x,y) + 0.1scity(x,y) + 0.3sstate(x,y)  sname(x,y): based on Jaro-Winkler  sphone(x,y): based on edit distance between x’s phone (after removing area code) and y’s phone  scity(x,y): based on edit distance  sstate(x,y): based on exact match; yes  1, no  0 11
  • 12. Pros and Cons  Pros  conceptually simple, easy to implement  can learn weights ®i from training data  Cons  an increase ± in the value of any si will cause a linear increase ®i * ± in the value of s  in certain scenarios this is not desirable, there after a certain threshold an increase in si should count less (i.e., “diminishing returns” should kick in)  e.g., if sname(x,y) is already 0.95 then the two names already very closely match  so any increase in sname(x,y) should contribute only minimally 12
  • 14. Logistic Regression Rules  Are also very useful in situations where  there are many “signals” (e.g., 10-20) that can contribute to whether two tuples match  we don’t need all of these signals to “fire” in order to conclude that the tuples match  as long as a reasonable number of them fire, we have sufficient confidence  Logistic regression is a natural fit for such cases  Hence is quite popular as a first matching method to try 14
  • 15. More Complex Rules  Appropriate when we want to encode more complex matching knowledge  e.g., two persons match if names match approximately and either phones match exactly or addresses match exactly 1. If sname(x,y) < 0.8 then return “not matched” 2. Otherwise if ephone(x,y) = true then return “matched” 3. Otherwise if ecity(x,y) = true and estate(x,y) = true then return “matched” 4. Otherwise return “not matched” 15
  • 16. Pros and Cons of Rule-Based Approaches  Pros  easy to start, conceptually relatively easy to understand, implement, debug  typically run fast  can encode complex matching knowledge  Cons  can be labor intensive, it takes a lot of time to write good rules  can be difficult to set appropriate weights  in certain cases it is not even clear how to write rules  learning-based approaches address these issues 16
  • 17. Outline  Problem definition  Rule-based matching  Learning- based matching  Matching by clustering  Probabilistic approaches to matching  Collective matching  Scaling up data matching 17
  • 18. Learning-based Matching  Here we consider supervised learning  learn a matching model M from training data, then apply M to match new tuple pairs  will consider unsupervised learning later  Learning a matching model M (the training phase)  start with training data: T = {(x1,y1,l1), … (xn,yn,ln)}, where each (xi,yi) is a tuple pair and li is a label: “yes” if xi matches yi and “no” otherwise  define a set of features f1, …, fm, each quantifying one aspect of the domain judged possibly relevant to matching the tuples 18
  • 19. Learning-based Matching  Learning a matching model M (continued)  convert each training example (xi,yi,li) in T to a pair (<f1(xi,yi), …, fm(xi,yi)>, ci)  vi = <f1(xi,yi), …, fm(xi,yi)> is a feature vector that encodes (xi,yi) in terms of the features  ci is an appropriately transformed version of label l_i (e.g., yes/no or 1/0, depending on what matching model we want to learn)  thus T is transformed into T’ = {(v1,c1), …, (vn,cn)}  apply a learning algorithm (e.g. decision trees, SVMs) to T’ to learn a matching model M 19
  • 20. Learning-based Matching  Applying model M to match new tuple pairs  given pair (x,y), transform it into a feature vector  v = <f1(x,y), …, fm(x,y)>  apply M to v to predict whether x matches y 20
  • 21. Example: Learning a Linearly Weighted Rule  s1 and s2 use Jaro-Winkler and edit distance  s3 uses edit distance (ignoring area code of a)  s4 and s5 return 1 if exact match, 0 otherwise  s6 encodes a heuristic constraint 21
  • 22. Example: Learing a Linearly Weighted Rule  22
  • 23. Example: Learning a DecisionTree 23 Now the labels are yes/no, not 1/0
  • 24. The Pros and Cons of Learning-based Approach  Pros compared to rule-based approaches  in rule-based approaches must manually decide if a particular feature is useful  labor intensive and limit the number of features we can consider  learning-based ones can automatically examine a large number of features  learning-based approaches can construct very complex “rules”  Cons  still require training examples, in many cases a large number of them, which can be hard to obtain  clustering addresses this problem 24
  • 25. Outline  Problem definition  Rule-based matching  Learning- based matching  Matching by clustering  Probabilistic approaches to matching  Collective matching  Scaling up data matching 25
  • 26. Matching by Clustering  Many common clustering techniques have been used  agglomerative hierarchical clustering (AHC), k-means, graph- theoretic, …  here we focus on AHC, a simple yet very commonly used one  AHC  partitions a given set of tuples D into a set of clusters  all tuples in a cluster refer to the same real-world entity, tuples in different clusters refer to different entities  begins by putting each tuple in D into a single cluster  iteratively merges the two most similar clusters  stops when a desired number of clusters has been reached, or until the similarity between two closest clusters falls below a pre-specified threshold 26
  • 27. Example  sim(x,y) = 0.3sname(x,y) + 0.3sphone(x,y) + 0.1scity(x,y) + 0.3sstate(x,y) 27
  • 28. Computing a Similarity Score betweenTwo Clusters  Let c and d be two clusters  Single link: s(c,d) = minxi2c, yj2d sim(xi, yj)  Complete link: s(c,d) = maxxi2c, yj2d sim(xi, yj)  Average link: s(c,d) = [xi2c, yj2d sim(xi, yj)] / [# of (xi,yj) pairs]  Canonical tuple  create a canonical tuple that represents each cluster  sim between c and d is the sim between their canonical tuples  canonical tuple is created from attribute values of the tuples  e.g., “Mike Williams” and “M. J. Williams”  “Mike J. Williams”  (425) 247 4893 and 247 4893  (425) 247 4893 28
  • 29. Key Ideas underlying the Clustering Approach  View matching tuples as the problem of constructing entities (i.e., clusters)  The process is iterative  leverage what we have known so far to build “better” entities  In each iteration merge all matching tuples within a cluster to build an “entity profile”, then use it to match other tuples  merging then exploiting the merged information to help matching  These same ideas appear in subsequent approaches that we will cover 29
  • 30. Outline  Problem definition  Rule-based matching  Learning- based matching  Matching by clustering  Probabilistic approaches to matching  Collective matching  Scaling up data matching 30
  • 31. Probabilistic Approaches to Matching  Model matching domain using a probability distribution  Reason with the distribution to make matching decisions  Key benefits  provide a principled framework that can naturally incorporate a variety of domain knowledge  can leverage the wealth of prob representation and reasoning techniques already developed in the AI and DB communities  provide a frame of reference for comparing and explaining other matching approaches  Disadvantages  computationally expensive  often hard to understand and debug matching decisions 31
  • 32. What We Discuss Next  Most current probabilistic approaches employ generative models  these encode full prob distributions and describe how to generate data that fit the distributions  Some newer approaches employ discriminative models (e.g., conditional random fields)  these encode only the probabilities necessary for matching (e.g., the probability of a label given a tuple pair)  Here we focus on generative model based approaches  first we explain Bayesian networks, a simple type of generative models  then we use them to explain more complex ones 32
  • 33. Bayesian Networks: Motivation  Let X = {x1, …, xn} be a set of variables  e.g., X = {Cloud, Sprinkler}  A state = an assignment of values to all variables in X  e.g., s = {Cloud = true, Sprinkler = on}  A probability distribution P assigns to each state si a value P(si) such that  si2S P(si) = 1  S is the set of all states  P(si) is called the probability of si 33
  • 34. Bayesian Networks: Motivation  Reasoning with prob models: to answer queries such as  P(A = a)? P(A = a|B = b) = ? where A and B are subsets of vars  Examples  P(Cloud = t) = 0.6 (by summing over first two rows)  P(Cloud = t | Sprinkler = off) = 0.75  Problems: can’t enumerate all states, too many of them  real-world apps often use hundreds or thousands of variables  Bayesian networks solve this by providing a compact representation of a probability distribution 34
  • 35. Baysian Networks: Representation  nodes = variables, edges = probabilistic dependencies  Key assertion: each node is probabilistically independent of its non-descendants given the values of its parents  e.g., WetGrass is independent of Cloud given Sprinkler & Rain  Sprinkler is independent of Rain given Cloud 35
  • 36. Baysian Networks: Representation  The key assertation allows us to write  P(C,S,R,W) = P(C).P(S|C).P(R|C).P(W|R)  Thus, to compute P(C,S,R,W), need to know only four local probability distributions, also called conditional probability tables (CPTs)  use only 9 statements to specify the full PD, instead of 16 36
  • 37. Bayesian Networks: Reasoning  Also called performing inference  computing P(A) or P(A|B), where A and B are subsets of vars  Performing exact inference is NP-hard  taking time exponential in number of variables in worst case  Data matching approaches address this in three ways  for certain classes of BNs there are polynomial-time algorithms or closed-form equations that return exact answers  use standard approximate inference algorithms for BNs  develop approximate algorithms tailored to the domain at hand 37
  • 38. Learning Bayesian Networks  To use a BN, current data matching approaches  typically require a domain expert to create the graph  then learn the CPTs from training data  Training data: set of states we have observed  e.g., d1 = (Cloud=t, Sprinkler=off, Rain=t, WetGrass=t) d2 = (Cloud=t, Sprinkler=off, Rain=f, WetGrass=f) d3 = (Cloud=f, Sprinkler=on, Rain=f, WetGrass=t)  Two cases  training data has no missing values  training dta has some missing values  greatly complicates learning, must use EM algorithm  we now consider them in turn 38
  • 39. Learning with No MissingValues  d1 = (1,0) means A = 1 and B = 0 39
  • 40. Learning with No MissingValues  Let µ be the probabilities to be learned. Want to find µ* that maximizes the prob of observing the training data D  µ* = arg maxµ P(D|µ)  µ* can be obtained by simple counting over D  E.g., to compute P(A = 1): count # of examples where A = 1, divide by total # of examples  To compute P(B = 1 | A = 1): divide # of examples where B = 1 and A =1 by # of examples where A = 1  What if not having sufficient data for certain states?  e.g., need to compute P(B=1|A=1), but # states where A = 1 is 0  need smoothing of the probabilities (see notes) 40
  • 41. Learning with MissingValues  Training examples may have missing values  d = (Cloud=?, Sprinkler=off, Rain=?, WetGrass=t)  Why?  we failed to observe a variable  e.g., slept and did not observe whether it rained  the variable by its nature is unobservable  e.g., werewolves who only get out during dark moonless night  can’t never tell if the sky is cloudy  Can’t use counting as before to learn (e.g., infer CPTs)  Use EM algorithm 41
  • 42. The Expectation-Maximization (EM) Algorithm  Key idea:  two unknown quantities: theta and missing values in D  iteratively estimates these two, by assigning initial values, then using one to predict the other and vice versa, until convergence 42
  • 43. An Example  EM also aims to find µ that maximizes P(D|µ)  just like the counting approach in case of no missing values  It may not find the globally maximal µ*  converging instead to a local maximum 43
  • 44. Bayesian Networks as Generative Models  Generative models  encode full probability distributions  specify how to generate data that fit such distributions  Bayesian networks: well-known examples of such models  A perspective on how the data is generated helps  guide the construction of the Bayesian network  discover what kinds of domain knowledge to be naturally incorporated into the network structure  explain the network to users  We now examine three prob approaches to matching that employ increasingly complex generative models 44
  • 45. Data Matching with Naïve Bayes  Define variable M that represents whether a and b match  Our goal is to compute P(M|a,b)  declare a and b matched if P(M=t|a,b) > P(M=f|a,b)  Assume P(M|a,b) depends only on S1, …, Sn, features that are functions that take as input a and b  e.g., whether two last names match, edit distance between soc sec numbers, whether the first initials match, etc.  P(M|a,b) = P(M|S1, …, Sn), using Bayes Rule, we have  P(M|S1, …, Sn) = P(S1, …, Sn|M)P(M)/P(S1, …, Sn) 45
  • 46. Data Matching with Naïve Bayes  46
  • 47. The Naïve Bayes Model  The assumption that S1, …, Sn are independent of one another given M is called the Naïve Bayes assumption  which often does not hold in practice  Computing P(M|S1, …, Sn) is performing an inference on the above Bayesian network  Given the simple form of the network, this inference can be performed easily, if we know the CPTs 47
  • 48. Learning the CPTs GivenTraining Data  48
  • 49. Learning the CPTs Given NoTraining Data  Assume (a4,b4), …, (a6,b6) are tuple pairs to be matched  Convert these pairs into training data with missing values  the missing value is the correct label for each pair (i.e., the value for variable M: “matched”, “not matched”)  Now apply EM algorithm to learn both the CPTs and the missing values at the same time  once learned, the missing values are the labels (i.e., “matched”, “not matched”) that we want to see 49
  • 50. Summary  The developer specifies the network structure, i.e., the directed acyclic graph  which is a Naïve Bayesian network structure in this case  If given training data in form of tuple pairs together with their correct labels (matched, not matched), we can learn the CPTs of the Naïve Bayes network using counting  then we use the trained network to match new tuple pairs (which means performing exact inferences to compute P(M|a,b))  People also refer to the Naïve Bayesian network as a Naïve Bayesian classifier 50
  • 51. Summary (cont.)  If no training data is given, but we are given a set of tuple pairs to be matched, then we can use these tuple pairs to construct training data with missing values  we then apply EM to learn the missing values and the CPTs  the missing values are the match predictions that we want  The above procedures (for both cases of having and not having training data) can be generalized in a straightforward fashion to arbitrary Bayesian network cases, not just Naïve Bayesian ones 51
  • 52. Modeling Feature Correlations  Naïve Bayes assumes no correlations among S1, …, Sn  We may want to model such correlations  e.g., if S1 models whether soc sec numbers match, and S3 models whether last names match, then there exists a correlation between the two  We can then train and apply this more expressive BN to match data  Problem: “blow up” the number of probs in the CPTs  assume n is # of features, q is the # of parents per node, and d is the # of values per node  O(ndq) vs. 2dn for the comparable Naïve Bayesian 52
  • 53. Modeling Feature Correlations  A possible solution  assume each tuple has k attributes  consider only k features S1, …, Sk, the i-th feature compares only values of the i-th attributes  introduce binary variables Xi, Xi models whether the i-th attributes should match, given that the tuples match  then model correlation only at the Xi level, not at Si level  This requires far fewer probs in CPTs  assume each node has q parents, and each S_i has d vallues, then we need O(k2q + 2kd) probs 53
  • 54. Key Lesson  Constructing a BN for a matching problem is an art that must consider the trade-offs among many factors  how much domain knowledge to be captured  how accurately we can learn the network  how efficiently we can do so  The notes present an even more complex example about matching mentions of entities in text 54
  • 55. Outline  Problem definition  Rule-based matching  Learning- based matching  Matching by clustering  Probabilistic approaches to matching  Collective matching  Scaling up data matching 55
  • 56. Collective Matching  Matching approaches discussed so far make independent matching decisions  decide whether a and b match independently of whether any two other tuples c and d match  Matching decisions hower are often correlated  exploiting such correlations can improve matching accuracy 56
  • 57. An Example 57  Goal: match authors of the four papers listed above  Solution  extract their names to create the table above  apply current approaches to match tuples in table  This fails to exploit co-author relationships in the data
  • 58. An Example (cont.)  nodes = authors, hyperedges connect co-authors  Suppose we have matched a3 and a5  then intuitively a1 and a4 should be more likely to match  they share the same name and same co-author relationship to the same author 58
  • 59. An Example (cont.)  First solution:  add coAuthors attribute to the tuples  e.g., tuple a_1 has coAuthors = {C. Chen, A. Ansari}  tuple a_4 has coAuthors = {A. Ansari}  apply current methods, use say Jaccard measure for coAuthors 59
  • 60. An Example (cont.)  Problem:  suppose a3: A. Ansari and a5: A. Ansari share same name but do not match  we would match them, and incorrectly boost score of a1 and a4  How to fix this?  want to match a3 and a5, then use that info to help match a1 and a4; also want to do the opposite  so should match tuples collectively, all at once and iteratively 60
  • 61. Collective Matching using Clustering  Many collective matching approaches exist  clustering-based, probabilistic, etc.  Here we consider clustering-based (see notes for more)  Assume input is graph  nodes = tuples to be matched  edges = relationships among tuples 61
  • 62. Collective Matching using Clustering  To match, perform agglomerative hierarchical clustering  but modify sim measure to consider correlations among tuples  Let A and B be two clusters of nodes, define  sim(A,B) = ® * simattributes(A,B) + (1- ®) * simneighbors(A,B)  ® is pre-defined weight  simattributes(A,B) uses only attributes of A and B, examples of such scores are single link, complete link, average link, etc.  simneighbors(A,B) considers correlations  we discuss it next 62
  • 63. An Example of simneighbors(A,B)  Assume a single relationship R on the graph edges  can generalize to the case of multiple relationships  Let N(A) be the bags of the cluster IDs of all nodes that are in relationship R with some node in A  e.g., cluster A has two nodes a and a’, a is in relationship R with node b with cluster ID 3, and a’ is in relationship R with node b’ with cluster ID 3 and another node b’’ with cluster ID 5  N(A) = {3, 3, 5}  Define simneighbors(A,B) = Jaccard(N(A),N(B)) = |N(A) Å N(B)| / |N(A) [ N(B)| 63
  • 64. An Example of simneighbors(A,B)  Recall that earlier we also define a Jaccard measure as  JaccardSimcoAuthors(a,b) = |coAuthors(a) Å coAuthors(b)| / |coAuthors(a) [ coAuthors(b)|  Contrast that to  simneighbors(A,B) = Jaccard(N(A),N(B)) = |N(A) Å N(B)| / |N(A) [ N(B)|  In the former, we assume two co-authors match if their “strings” match  In the latter, two co-authors match only if they have the same cluster ID 64
  • 65. An Example to Illustrate the Working of Agglomerative Hierarchical Clustering 65
  • 66. Outline  Problem definition  Rule-based matching  Learning- based matching  Matching by clustering  Probabilistic approaches to matching  Collective matching  Scaling up data matching 66
  • 67. Scaling up Rule-based Matching  Two goals: minimize # of tuple pairs to be matched and minimize time it takes to match each pair  For the first goal:  hashing  sorting  indexing  canopies  using representatives  combining the techniques  Hashing  hash tuples into buckets, match only tuples within each bucket  e.g., hash house listings by zipcode, then match within each zip 67
  • 68. Scaling up Rule-based Matching  Sorting  use a key to sort tuples, then scan the sorted list and match each tuple with only the previous (w-1) tuples, where w is a pre-specified window size  key should be strongly “discriminative”: brings together tuples that are likely to match, and pushes apart tuples that are not  example keys: soc sec, student ID, last name, soundex value of last name  employs a stronger heuristic than hashing: also requires that tuples likely to match be within a window of size w  but is often faster than hashing because it would match fewer pairs 68
  • 69. Scaling up Rule-based Matching  Indexing  index tuples such that given any tuple a, can use the index to quickly locate a relatively small set of tuples that are likely to match a  e.g., inverted index on names  Canopies  use a computationally cheap sim measure to quickly group tuples into overlapping clusters called canopines (or umbrella sets)  use a different (far more expensive) sim measure to match tuples within each canopy  e.g., use TF/IDF to create canopies 69
  • 70. Scaling up Rule-based Matching  Using representatives  applied during the matching process  assigns tuples that have been matched into groups such that those within a group match and those across groups do not  create a representative for each group by selecting a tuple in the group or by merging tuples in the group  when considering a new tuple, only match it with the representatives  Combining the techniques  e.g., hash houses into buckets using zip codes, then sort houses within each bucket using street names, then match them using a sliding window 70
  • 71. Scaling up Rule-based Matching  For the second goal of minimizing time it takes to match each pair  no well-established technique as yet  tailor depending on the application and the matching approach  e.g., if using a simple rule-based approach that matches individual attributes then combines their scores using weights  can use short circuiting: stop the computation of the sim score if it is already so high that the tuple pair will match even if the remaining attributes do not match 71
  • 72. Scaling up Other Matching Methods  Learning, clustering, probabilistic, and collective approaches often face similar scalability challenges, and can benefit from the same solutions  Probabilistic approaches raise additional challenges  if model has too many parameters  difficult to learn efficiently, need a large # of training data to learn accurately  make independence assumptions to reduce # of parameters  Once learned, inference with model is also time costly  use approximate inference algorithms  simplify model so that closed form equations exist  EM algorithm can be expensive  truncate EM, or initializing it as accurately as possible 72
  • 73. Scaling up Using Parallel Processing  Commonly done in practice  Examples  hash tuples into buckets, then match each bucket in parallel  match tuples against a taxonomy of entities (e.g., a product or Wikipedia-like concept taxonomy) in parallel  two tuples are declared matched if they match into the same taxonomic node  a variant of using representatives to scale up, discussed earlier 73
  • 74. Summary  Critical problem in data integration  Huge amount of work in academia and industry  Rule-based matching  Learning- based matching  Matching by clustering  Probabilistic approaches to matching  Collective matching  This chapter has covered only the most common and basic approaches  The bibliography discusses much more 74