SlideShare a Scribd company logo
THE MECHANICS OF
PROBABILISTIC RECORD
MATCHING
Jeffrey Tyzzer
Why Does this Deck Exist?
!  I struggled while studying probabilistic matching--
reading, e.g., the works of Fellegi and Sunter,
Newcombe, Schumacher, and Herzog, et al.--and
wanted to summarize my findings as much to help
others understand it as to check my own
understanding. To that end, please direct any errors
and constructive feedback to me at
jefftyzzer@sbcglobal.net
2
Agenda
!  Recall that Master Data Management (MDM)
enables the consolidation and syndication of
trusted, authoritative, data
!  In this presentation, we focus on the consolidation--
or unification--of master data, which is the heart of
all MDM systems
3
Matches
!  In a data set, constructs (i.e. records) are proxies for
real-world objects
!  Matches are entity instances (records) that have the
same values for those properties (attributes) that
serve to identify them
!  One of the goals of Master Data Management is to
ensure that there is a 1:1 correspondence between
the real and proxy objects
4
Ways of Matching
!  There are two principal ways to match: deterministically and
probabilistically
!  Deterministic matching is rules-based, e.g. IF R1a1 = R2a1 AND
R1a2 = R2a2 THEN Link ELSE NonLink
!  Deterministic matching is binary--all or nothing
!  Probabilistic matching is likelihood-based
!  Probabilistic matching is analog--it’s based on a range of
agreement
!  The pioneers of probabilistic matching were Newcombe, et
al., Tepping, and Fellegi & Sunter.
!  Probabilistic matching is particularly useful in the absence of
unique identifiers, when only so-called quasi-identifiers are
available, such as names and dates birth
5
Consider
!  R1 Name: Jeff Tyzzer Address: 848 Swanston Dr.
Phone: (916) 555-1212
!  R2 Name: Jeffrey Tyzzer Address: 884 Swanson Dr.
Phone: 555-1212
!  Would you consider these two records to be
matches? Why? Would they be deterministic or
probabilistic matches?
6
Hypothesis Testing
!  In classic probabilistic matching, we take our cue
from inferential statistics when comparing two
records probabilistically:
!  H0 - The null hypothesis: The records do not represent
the same real-world object, i.e. they are not matches
!  HA - The alternate hypothesis: The records represent the
same real-world object, i.e. they are matches
!  Typically, H0 is rejected if our test statistic is less than .
05 (the so-called “p-value”)
7
Hypothesis Testing, cont’d
!  A Type I error, designated with the Greek letter α
(alpha), occurs when we incorrectly reject H0
!  A Type II error, designated with the Greek letter β
(beta), occurs when we incorrectly fail to reject H0
8
Record Linkage and Type I & II Errors
!  Since we’ve decided that H0 indicates that the
records are different, if we commit a Type I error
(incorrectly rejecting H0) we’re (wrongly) asserting
that the records match. This is a false positive
!  Since we’ve decided that HA indicates that the
records are the same (matches), if we commit a
Type II error (incorrectly failing to reject H0) we’re
(wrongly) asserting that the records do not match.
This is a false negative
9
Agreement Probabilities
!  We must first decide on our match attributes, a domain-
specific decision. For this presentation, we will use First
Name, Last Name, and DoB
!  For our purposes, when comparing these attributes
between records there are two possible outcomes: they
will agree or they won’t
!  We calculate the probabilities of these attributes
agreeing under each of the preceding hypotheses.
There are several methods for computing these; among
them are sampling, prior studies, and Maximum
Likelihood Estimation (MLE) using Expectation
Maximization (EM)
10
Example
Attribute Non-match (H0) Match (HA)
Last Name .05 .95
First Name .15 .90
DoB .25 .85
!  Using one of the techniques mentioned in slide 10’s
last bullet point, say we find that, for our data, when
the two records do in fact represent the same entity
the last names match 95% of the time, the first names
90%, and the DoBs 85%. When the two records are
known to represent different entities, the match rates
are much lower--5%, 15%, and 25%, respectively
11
Match Attribute Possibilities
!  Since for simplicity’s sake we’re saying that the
attributes must simply either match or not--designating
1 for a match and 0 for a non-match--then for our three
attributes we have the following 23 agreement
possibilities:
LN FN DoB
0 0 0
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
1 1 1
12
Match Attribute Probabilities
!  The space of all possible agreement patterns is referred by the
Greek letter Γ (gamma)
!  Given the agreement probabilities listed on slide 11, we next
compute two probabilities for each of the eight agreement patterns
(slide 11) in Γ (in the same attribute order): the m (match)
probability and the u (non-match) probability
!  Example - the m probability for the (0,0,0) pattern (i.e. none match):
(1 - .95) * (1 - .90) * (1 - .85) = 0.00075
!  Example - the u probability for the (1,0,1) pattern (match on LN and
DoB):
(.05) * (1-.15) * (.25) = 0.01063
!  The agreement pattern is viewed as a discrete random variable
representing the set of all possible comparison outcomes
13
Match Attribute Probabilities, cont’d
!  The completed table looks like this:
Agreement Pattern m u
0,0,0 .00075 .60563
1,0,0 .01425 .03188
0,1,0 .00675 .10688
0,0,1 .00425 .20188
1,1,0 .12825 .00563
1,0,1 .08075 .01063
0,1,1 .03825 .03563
1,1,1 .72675 .00188
14
Observations
!  Given the agreement probabilities on slide 11, only
72.675% of the records would have matched
deterministically and only 60.563% of those records that
don’t match would have disagreed on all three attributes
!  Both columns (must) sum to 1
!  Probabilistic matching gives us maybe in addition to yes and
no as a possible outcome--it lets us deal with those situations
where not all attributes match, but some do (recall your
answers to the questions on slide 6)
!  This technique assumes conditional independence among the
match attributes, which may not always be the case
(consider the correlation between name and gender)
15
Almost There
!  The next two steps are:
!  Calculate the log-likelihood ratio test statistic T, the
base-2 logarithm of the ratio of m and u
e.g., T = log2(0.03825/0.03563) = 0.10237
and order the results ascending by T
!  Sum the cumulative probabilities (m top down, u bottom
up)
16
The Test Statistic & Cumulative Prob’s
Agreement
Pattern
m u T Σm (α) Σu (β)
0,0,0 0.00075 0.60563 -9.65733 0.00075 1.00000
0,0,1 0.00425 0.20188 -5.56989 0.00500 0.39441
0,1,0 0.00675 0.10688 -3.98496 0.01175 0.19253
1,0,0 0.01425 0.03188 -1.16169 0.02600 0.08565
0,1,1 0.03825 0.03563 0.10237 0.06425 0.05377
1,0,1 0.08075 0.01063 2.92532 0.14500 0.01814
1,1,0 0.12825 0.00563 4.50968 0.27325 0.00751
1,1,1 0.72675 0.00188 8.59458 1.00000 0.00188
17
Deciding on the Thresholds
!  We have three choices when confronted with a pair of records: definitely
link them, definitely do not link them, and maybe link them. How do we
decide? By establishing thresholds for each of the three possibilities,
resulting in three discrete (and disjoint) T regions (slide 17)
!  If, as we said on slide 7, we reject H0 when the test statistic is less than .05,
then we’ve decided that we’re willing to accept an alpha of .05, meaning
that we’re OK with a Type I error (a false positive, given our definitions of
H0 and HA) 5% of the time. In other words, we’re willing to accept that up
to 5% of our linked records could be linked erroneously
!  Assume that beta, our tolerance for a Type II error (a false negative, given
our definitions of H0 and HA) is also .05. (Note that the false positive and
negative thresholds are domain-specific--what’s the possible harm of a
false positive in a hospital setting versus one for, say, a direct marketer
compiling a household address list?)
18
Deciding on the Thresholds, cont’d
!  The sum of the m probabilities represents our false
positive rate and the sum of the u probabilities is our
false negative rate. The last two columns in the table on
slide 17, respectively, show these
!  Our settings of alpha and beta dictate that any pair of
records with a T of -1.16169 (μ) or less is a definite
non-link and that any pair of records with a T of
2.92532 (λ) or greater is a definite link. Thus,
those with an agreement pattern of (0,1,1) are
our maybes. This is known as the clerical review
region
19
A Graphical Representation
0.00000
0.20000
0.40000
0.60000
0.80000
1.00000
1.20000
-9.65733 -5.56989 -3.98496 -1.16169 0.10237 2.92532 4.50968 8.59458
λ μ
20
Interpretation
!  Record pairs to the left of the red line (lambda) are a
certain no and those to the right of the green line (mu)
are a certain yes. In-between the two lines is the
“maybe” region, whose record pairs require human
review
!  Fellegi & Sunter’s technique assures us that the “maybe”
region is as small as possible given our settings for
alpha and beta (ref. the Neyman–Pearson lemma)
!  The width of the clerical region is a function of the
values of α and β (slide 8)
21
Example I
Record LN FN DoB
1 Tyzzer John 5/26/19xx
2 Tyzzer Jeff 5/26/19xx
!  The agreement pattern is (1,0,1). Given its
corresponding T value, these records would be
classified as a match
22
Example II
Record LN FN DoB
1 Smith Jeff 5/26/19xx
2 Tyzzer Jeff 5/26/19xx
!  The agreement pattern is (0,1,1). Given its
corresponding T value, these records would be
classified as a maybe and queued for clerical
review
23
Some Final Thoughts
!  To compute the agreement probabilities (slide 11), the expectation
maximization (EM) technique is usually employed. These probabilities drive
all subsequent results
!  The demonstrated scenario and examples are deliberately trivial
!  A more realistic situation would likely include more match columns and
several more possible configurations of them instead of simple agreement
or disagreement
!  A more realistic situation would also have accommodated fuzzy matches
and incorporated value-specific frequencies into the probability
calculations. For last name, say, the agreement pattern would then be
interpreted as “the LN agrees and is <>,” e.g. “Smith”
!  To reduce the number of record-to-record comparisons from n(n-1)/2
(intrafile) or n*m (interfile) to something manageable, blocking (e.g. on zip
code or the phonetic encoding of the surname) is typically used
24
References
!  B Do Chuong, and Serafim Batzoglou. “What is the Expectation
Maximization Algorithm?” Nature Biotechnology 26.8 (2008): 897-9.
!  Fellegi, Ivan, and Alan B. Sunter. “A Theory for Record Linkage.”
Journal of the American Statistical Association 64.328 (1969):
1183-1210.
!  Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data
Quality and Record Linkage Techniques. New York: Springer Science
+ Business Media, 2007.
!  ---. “Record Linkage.” WIRE’s Computational Statistics 2.5 (2010):
535-543.
!  Kirkendall, Nancy. “Weights in Computer Matching: Applications and
an Information Theoretic Point of View.” Record Linkage
Techniques--1985. Internal Revenue Service.
25

More Related Content

PDF
Prediction of Changes That May Occur in the Neutral Cases in Conflict Theory ...
PPT
Linear regression
PPT
Linear regression
PDF
Binary OR Binomial logistic regression
PDF
Probability and basic statistics with R
PPT
Chapter 04
PDF
Heteroscedasticity
PDF
Prediction of Changes That May Occur in the Neutral Cases in Conflict Theory ...
Linear regression
Linear regression
Binary OR Binomial logistic regression
Probability and basic statistics with R
Chapter 04
Heteroscedasticity

Similar to The Mechanics of Probabilistic Record Matching (20)

PPTX
Introduction to Maximum Likelihood Estimator
PDF
Classification methods and assessment
PDF
Power Analysis Attacks
PPT
chap4_Parametric_Methods.ppt
DOCX
MSL 5080, Methods of Analysis for Business Operations 1 .docx
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
PPTX
Regresi Data Panel dalam ekonometrika.pptx
PPT
03Preprocessing.ppt Processing in Computer Science
PPTX
03Preprocessing_20160222datamining5678.pptx
PPTX
03Preprocessing_20160222datamining5678.pptx
DOCX
DBM380 v14Create a DatabaseDBM380 v14Page 2 of 2Create a D.docx
PDF
Classification methods and assessment.pdf
PDF
Logistic regression teaching
PDF
0 Model Interpretation setting.pdf
PDF
Chapter8 Introduction to Estimation Hypothesis Testing.pdf
PPT
preprocessing so that u can you these thing in your daily lifeppt
PPT
Data Mining and Warehousing Concept and Techniques
PPT
03 Data Mining Concepts and TechniquesPreprocessing.ppt
PPT
03PreprocessindARA AJAJAJJAJAJAJJJAg.ppt
PPT
Data Preprocessing in research methodology
Introduction to Maximum Likelihood Estimator
Classification methods and assessment
Power Analysis Attacks
chap4_Parametric_Methods.ppt
MSL 5080, Methods of Analysis for Business Operations 1 .docx
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Regresi Data Panel dalam ekonometrika.pptx
03Preprocessing.ppt Processing in Computer Science
03Preprocessing_20160222datamining5678.pptx
03Preprocessing_20160222datamining5678.pptx
DBM380 v14Create a DatabaseDBM380 v14Page 2 of 2Create a D.docx
Classification methods and assessment.pdf
Logistic regression teaching
0 Model Interpretation setting.pdf
Chapter8 Introduction to Estimation Hypothesis Testing.pdf
preprocessing so that u can you these thing in your daily lifeppt
Data Mining and Warehousing Concept and Techniques
03 Data Mining Concepts and TechniquesPreprocessing.ppt
03PreprocessindARA AJAJAJJAJAJAJJJAg.ppt
Data Preprocessing in research methodology
Ad

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Quality review (1)_presentation of this 21
PPTX
Computer network topology notes for revision
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Database Infoormation System (DBIS).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Mega Projects Data Mega Projects Data
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
STUDY DESIGN details- Lt Col Maksud (21).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IB Computer Science - Internal Assessment.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Quality review (1)_presentation of this 21
Computer network topology notes for revision
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Acumen Training GuidePresentation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
ISS -ESG Data flows What is ESG and HowHow
Database Infoormation System (DBIS).pptx
.pdf is not working space design for the following data for the following dat...
Miokarditis (Inflamasi pada Otot Jantung)
Mega Projects Data Mega Projects Data
Ad

The Mechanics of Probabilistic Record Matching

  • 1. THE MECHANICS OF PROBABILISTIC RECORD MATCHING Jeffrey Tyzzer
  • 2. Why Does this Deck Exist? !  I struggled while studying probabilistic matching-- reading, e.g., the works of Fellegi and Sunter, Newcombe, Schumacher, and Herzog, et al.--and wanted to summarize my findings as much to help others understand it as to check my own understanding. To that end, please direct any errors and constructive feedback to me at jefftyzzer@sbcglobal.net 2
  • 3. Agenda !  Recall that Master Data Management (MDM) enables the consolidation and syndication of trusted, authoritative, data !  In this presentation, we focus on the consolidation-- or unification--of master data, which is the heart of all MDM systems 3
  • 4. Matches !  In a data set, constructs (i.e. records) are proxies for real-world objects !  Matches are entity instances (records) that have the same values for those properties (attributes) that serve to identify them !  One of the goals of Master Data Management is to ensure that there is a 1:1 correspondence between the real and proxy objects 4
  • 5. Ways of Matching !  There are two principal ways to match: deterministically and probabilistically !  Deterministic matching is rules-based, e.g. IF R1a1 = R2a1 AND R1a2 = R2a2 THEN Link ELSE NonLink !  Deterministic matching is binary--all or nothing !  Probabilistic matching is likelihood-based !  Probabilistic matching is analog--it’s based on a range of agreement !  The pioneers of probabilistic matching were Newcombe, et al., Tepping, and Fellegi & Sunter. !  Probabilistic matching is particularly useful in the absence of unique identifiers, when only so-called quasi-identifiers are available, such as names and dates birth 5
  • 6. Consider !  R1 Name: Jeff Tyzzer Address: 848 Swanston Dr. Phone: (916) 555-1212 !  R2 Name: Jeffrey Tyzzer Address: 884 Swanson Dr. Phone: 555-1212 !  Would you consider these two records to be matches? Why? Would they be deterministic or probabilistic matches? 6
  • 7. Hypothesis Testing !  In classic probabilistic matching, we take our cue from inferential statistics when comparing two records probabilistically: !  H0 - The null hypothesis: The records do not represent the same real-world object, i.e. they are not matches !  HA - The alternate hypothesis: The records represent the same real-world object, i.e. they are matches !  Typically, H0 is rejected if our test statistic is less than . 05 (the so-called “p-value”) 7
  • 8. Hypothesis Testing, cont’d !  A Type I error, designated with the Greek letter α (alpha), occurs when we incorrectly reject H0 !  A Type II error, designated with the Greek letter β (beta), occurs when we incorrectly fail to reject H0 8
  • 9. Record Linkage and Type I & II Errors !  Since we’ve decided that H0 indicates that the records are different, if we commit a Type I error (incorrectly rejecting H0) we’re (wrongly) asserting that the records match. This is a false positive !  Since we’ve decided that HA indicates that the records are the same (matches), if we commit a Type II error (incorrectly failing to reject H0) we’re (wrongly) asserting that the records do not match. This is a false negative 9
  • 10. Agreement Probabilities !  We must first decide on our match attributes, a domain- specific decision. For this presentation, we will use First Name, Last Name, and DoB !  For our purposes, when comparing these attributes between records there are two possible outcomes: they will agree or they won’t !  We calculate the probabilities of these attributes agreeing under each of the preceding hypotheses. There are several methods for computing these; among them are sampling, prior studies, and Maximum Likelihood Estimation (MLE) using Expectation Maximization (EM) 10
  • 11. Example Attribute Non-match (H0) Match (HA) Last Name .05 .95 First Name .15 .90 DoB .25 .85 !  Using one of the techniques mentioned in slide 10’s last bullet point, say we find that, for our data, when the two records do in fact represent the same entity the last names match 95% of the time, the first names 90%, and the DoBs 85%. When the two records are known to represent different entities, the match rates are much lower--5%, 15%, and 25%, respectively 11
  • 12. Match Attribute Possibilities !  Since for simplicity’s sake we’re saying that the attributes must simply either match or not--designating 1 for a match and 0 for a non-match--then for our three attributes we have the following 23 agreement possibilities: LN FN DoB 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 1 1 12
  • 13. Match Attribute Probabilities !  The space of all possible agreement patterns is referred by the Greek letter Γ (gamma) !  Given the agreement probabilities listed on slide 11, we next compute two probabilities for each of the eight agreement patterns (slide 11) in Γ (in the same attribute order): the m (match) probability and the u (non-match) probability !  Example - the m probability for the (0,0,0) pattern (i.e. none match): (1 - .95) * (1 - .90) * (1 - .85) = 0.00075 !  Example - the u probability for the (1,0,1) pattern (match on LN and DoB): (.05) * (1-.15) * (.25) = 0.01063 !  The agreement pattern is viewed as a discrete random variable representing the set of all possible comparison outcomes 13
  • 14. Match Attribute Probabilities, cont’d !  The completed table looks like this: Agreement Pattern m u 0,0,0 .00075 .60563 1,0,0 .01425 .03188 0,1,0 .00675 .10688 0,0,1 .00425 .20188 1,1,0 .12825 .00563 1,0,1 .08075 .01063 0,1,1 .03825 .03563 1,1,1 .72675 .00188 14
  • 15. Observations !  Given the agreement probabilities on slide 11, only 72.675% of the records would have matched deterministically and only 60.563% of those records that don’t match would have disagreed on all three attributes !  Both columns (must) sum to 1 !  Probabilistic matching gives us maybe in addition to yes and no as a possible outcome--it lets us deal with those situations where not all attributes match, but some do (recall your answers to the questions on slide 6) !  This technique assumes conditional independence among the match attributes, which may not always be the case (consider the correlation between name and gender) 15
  • 16. Almost There !  The next two steps are: !  Calculate the log-likelihood ratio test statistic T, the base-2 logarithm of the ratio of m and u e.g., T = log2(0.03825/0.03563) = 0.10237 and order the results ascending by T !  Sum the cumulative probabilities (m top down, u bottom up) 16
  • 17. The Test Statistic & Cumulative Prob’s Agreement Pattern m u T Σm (α) Σu (β) 0,0,0 0.00075 0.60563 -9.65733 0.00075 1.00000 0,0,1 0.00425 0.20188 -5.56989 0.00500 0.39441 0,1,0 0.00675 0.10688 -3.98496 0.01175 0.19253 1,0,0 0.01425 0.03188 -1.16169 0.02600 0.08565 0,1,1 0.03825 0.03563 0.10237 0.06425 0.05377 1,0,1 0.08075 0.01063 2.92532 0.14500 0.01814 1,1,0 0.12825 0.00563 4.50968 0.27325 0.00751 1,1,1 0.72675 0.00188 8.59458 1.00000 0.00188 17
  • 18. Deciding on the Thresholds !  We have three choices when confronted with a pair of records: definitely link them, definitely do not link them, and maybe link them. How do we decide? By establishing thresholds for each of the three possibilities, resulting in three discrete (and disjoint) T regions (slide 17) !  If, as we said on slide 7, we reject H0 when the test statistic is less than .05, then we’ve decided that we’re willing to accept an alpha of .05, meaning that we’re OK with a Type I error (a false positive, given our definitions of H0 and HA) 5% of the time. In other words, we’re willing to accept that up to 5% of our linked records could be linked erroneously !  Assume that beta, our tolerance for a Type II error (a false negative, given our definitions of H0 and HA) is also .05. (Note that the false positive and negative thresholds are domain-specific--what’s the possible harm of a false positive in a hospital setting versus one for, say, a direct marketer compiling a household address list?) 18
  • 19. Deciding on the Thresholds, cont’d !  The sum of the m probabilities represents our false positive rate and the sum of the u probabilities is our false negative rate. The last two columns in the table on slide 17, respectively, show these !  Our settings of alpha and beta dictate that any pair of records with a T of -1.16169 (μ) or less is a definite non-link and that any pair of records with a T of 2.92532 (λ) or greater is a definite link. Thus, those with an agreement pattern of (0,1,1) are our maybes. This is known as the clerical review region 19
  • 20. A Graphical Representation 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 -9.65733 -5.56989 -3.98496 -1.16169 0.10237 2.92532 4.50968 8.59458 λ μ 20
  • 21. Interpretation !  Record pairs to the left of the red line (lambda) are a certain no and those to the right of the green line (mu) are a certain yes. In-between the two lines is the “maybe” region, whose record pairs require human review !  Fellegi & Sunter’s technique assures us that the “maybe” region is as small as possible given our settings for alpha and beta (ref. the Neyman–Pearson lemma) !  The width of the clerical region is a function of the values of α and β (slide 8) 21
  • 22. Example I Record LN FN DoB 1 Tyzzer John 5/26/19xx 2 Tyzzer Jeff 5/26/19xx !  The agreement pattern is (1,0,1). Given its corresponding T value, these records would be classified as a match 22
  • 23. Example II Record LN FN DoB 1 Smith Jeff 5/26/19xx 2 Tyzzer Jeff 5/26/19xx !  The agreement pattern is (0,1,1). Given its corresponding T value, these records would be classified as a maybe and queued for clerical review 23
  • 24. Some Final Thoughts !  To compute the agreement probabilities (slide 11), the expectation maximization (EM) technique is usually employed. These probabilities drive all subsequent results !  The demonstrated scenario and examples are deliberately trivial !  A more realistic situation would likely include more match columns and several more possible configurations of them instead of simple agreement or disagreement !  A more realistic situation would also have accommodated fuzzy matches and incorporated value-specific frequencies into the probability calculations. For last name, say, the agreement pattern would then be interpreted as “the LN agrees and is <>,” e.g. “Smith” !  To reduce the number of record-to-record comparisons from n(n-1)/2 (intrafile) or n*m (interfile) to something manageable, blocking (e.g. on zip code or the phonetic encoding of the surname) is typically used 24
  • 25. References !  B Do Chuong, and Serafim Batzoglou. “What is the Expectation Maximization Algorithm?” Nature Biotechnology 26.8 (2008): 897-9. !  Fellegi, Ivan, and Alan B. Sunter. “A Theory for Record Linkage.” Journal of the American Statistical Association 64.328 (1969): 1183-1210. !  Herzog, Thomas N., Fritz J. Scheuren, and William E. Winkler. Data Quality and Record Linkage Techniques. New York: Springer Science + Business Media, 2007. !  ---. “Record Linkage.” WIRE’s Computational Statistics 2.5 (2010): 535-543. !  Kirkendall, Nancy. “Weights in Computer Matching: Applications and an Information Theoretic Point of View.” Record Linkage Techniques--1985. Internal Revenue Service. 25