Linguistic Considerations of Identity Resolution (2008)

GOVERNMENT USERS
Conference
“Navigating the Human Terrain”
College Park, MD, May 20-21, 2008
Linguistic
Considerations of
Identity Resolution
David Murgatroyd
Software Architect
Basis Technology

2
Outline
 Introduction
 Linguistic Challenges
 Variation (Intentional & Unintentional)
 Composition
 Frequency
 Under-specification
 Multilinguality
 Integration Challenges
 Inputs & Outputs
 Properties
 Evaluation Challenges
 Corpora: Find or Build?
 Metrics: Adopt or Create?
 Conclusion

3
Introduction: An Exercise
Jim Killeen Kileen, J. D.
Jaime Kilin
‫كلين‬ ‫جمس‬
 Is there a >50% chance these refer to the same
person? If…US Citizens; On a ferry to Spain;
In a documentary

4
What is Identity Resolution?
 Identity Resolution (aka Entity Resolution):
 determining if two or more given references refer to
the same entity.
 Different from name matching as it’s about
identity of entities not similarity of names
 See also:
 Murgatroyd, D. Some Linguistic Considerations of
Entity Resolution and Retrieval. In Proceedings of
LREC 2008 Workshop on Resources and Evaluation for
Identity Matching, Entity Resolution and Entity
Management.

5
What sorts of references?
 Non-linguistic reference examples:
 Numerical identifiers
— SSN
— Some portions of address (Street Number, Zip Code)
 Visual identifiers (e.g., pictures, symbols)
 Biometrics (e.g., DNA, iris, signature, voice)
 Linguistic reference examples:
 Nouns or pronouns in documents (e.g., “the CEO of Basis”)
 Names of associated/related entities
— Locations (e.g., Street or City Name)
— Organizations
— Individuals
 Name of entity <- we’re going to focus on this one

6
Let’s focus on names of people
 Common and familiar
 Often fairly identifying piece of personal
information
 Demonstrate typical challenges of resolution
with linguistic data

7
Outline
 Introduction
 Composition
 Frequency
 Multilinguality
 Properties
 Conclusion

8
Variation (Intentional)
 Variation may be intentional
 References may be draw on a large set of names:
— Formality (e.g., nicknames)
— Transparency (e.g., aliases)
— Location (e.g., toponym)
— Life status
 Vocation (e.g., titles)
 Marital status (e.g., marriage/divorce/widowhood)
 Parenthood (e.g., patronymic)
 Faith (e.g., christening, pilgrimage)
 Death (e.g., posthumous names)
— Dialect (e.g., adolescent girls preferring “Jenni” over “Jenny”)
— Style of text (e.g., “Sollermun” for “Solomon” in Huck Finn)
Jim Killeen

9
Variation (Unintentional)
 Variation may be unintentional, arising from:
 Typos
— E.g., “Killeen” vs. “Kileen”
 Guessing spelling based on pronunciation
— E.g., “Caliin”
 Ambiguities inherent in the encoding (e.g., Unicode):
— Characters with the same glyph
 E.g., Latin and Cyrillic small “i”
— Characters with similar glyphs
 E.g., Latin “K” and Greenlandic “ĸ”
— Characters with composed/combined forms
 E.g., ņ (n with cedilla) vs. ņ (n + combining cedilla)
Kileen, J. D.

10
Composition
 Names have differing orders:
 Given v. Surname: “Killen, Jim” v. “Jim Killeen”
 Varies by culture
 Name references may be partial:
 “Jim” v. “Jim Killeen”

11
Under-specification
 Name components may be abbreviated
 Initials (e.g., “J. D.”)
 Abbreviations (e.g., “Jas.”)
 Name references may have incomplete…
 orthography (e.g., Semitic languages)
 segmentation (e.g., Asian languages)
 phonology (e.g., Ideographic languages)
Kileen, J. D.

12
Frequency
 Any person can make up a name (an open class)
 A few are common, most are very uncommon
 Zipfian distribution
 Lesson:
 Valuable to know
common names
 Valuable to have a
strategy for unknown
names

13
Multilinguality
 Names may appear in many languages-of-use
 This leads to variation at many linguistic levels.
 Orthographic:
 transliteration confronts skew in:
—orthographic-to-phonetic mappings of source and
target languages-of-use
—sound systems between the languages
‫كلين‬ ‫جمس‬ <-> James Klein

14
Multilinguality (cont’d)
 Syntactic:
 different languages-of-use may imply different name
word order
 Semantic:
 name words which communicate meaning (e.g.,
titles) may vary (e.g., “Jr.” for “‫الصغر‬ “which
means “the younger”)
 Pragmatic:
 different languages-of-use may use different names
based on the audience (e.g., “Mr. Laden” vs. “‫المير‬”
which means “the prince”)

15
Outline
 Introduction
 Composition
 Frequency
 Multilinguality
 Properties
 Conclusion

16
Inputs & Outputs
 Inputs options include:
 Pair-wise: simple integration, but no shared effort
 Set-based: harder integration, but able to optimize
 Output options include:
 Feature-based: with weights/tuning
 Probability-based:
—more principled combination
—NOTE: similarity is not probability

17
Integration Properties
 Certain properties help make efficient
implementations:
 Reflexivity:
—Resolve(a,a) is always true
—NOTE: does not imply Resolve(a,a’) where a~a’
 Commutativity:
—Resolve(a,b)  Resolve(b,a)
 Transitivity:
—Resolve(a,b) & Resolve(b,c) => Resolve(a,c)

18
Outline
 Introduction
 Composition
 Frequency
 Multilinguality
 Properties
 Conclusion

19
Corpora: Find or Build?
 Requirements:
 Annotated for ground truth
 Represent linguistic challenges
 Scalable/practical
 Options
 Adapt public “database” corpora:
— Wikipedia:
 Annotated: yes
 Representative: somewhat
 Scalable: yes
— Citation DBs:
 Annotated: no
 Representative: somewhat
 Scalable: yes

20
Corpora: Find or Build? (cont’d)
 Adapt public “document” corpora:
— Co-reference documents:
 Annotated: yes
 Representative: less as often single doc/language-of-use
 Scalable: yes
 Create corpora by hand:
— From scratch: “parrot sessions” (auditory or visual)
 Annotated: yes
 Representative: largely
 Scalable: no
— From un-annotated databases:
 Annotated: no
 Representative: yes
 Scalable/practical: no; databases may be private
— Synthesize from generative model
 Annotated: yes
 Representative: no, tied to generating model
 Scalable: yes

21
Metrics
 Back to our initial example
Jim Killeen Kileen, J. D.
Jaime Kilin
Jim
JDKJimK illeen
J. Diw Killeen
Reference
System A
System B

22
Metrics: Adopt or Create?
 How to quantify the quality of the system’s resolutions
vs. the reference?
 Goals:
 Discriminative: separates good v. bad systems for users’ needs
 Interpretable: number aligns with intuition
 Considerations:
 Assume transitive closure (TC) of output?
 Apply weights to try to be more discriminative?
 Common concepts:
 Precision: % of stuff in answer that’s right
 Recall: % of right stuff in answer
 F-Score: Harmonic mean of these = 2*P*R/(P+R)

23
Candidate Metrics
 Pair-wise % correct: over all N*(N-1)/2 node pairs
 Pair-wise P&R: based on links drawn
 Edit-distance: # of links to add/subtract to correct
 Metrics used in document co-reference resolution:
 MUC-6: entity-based P&R on missing links from graph
 B-CUBED: average per-reference P&R of links
 CEAF (Constrained Entity-Alignment F): entities aligned
using some similarity measure; P&R are % of possible
similarity level achieved

24
Comparing Metrics
Jim Killeen
Jaime Kilin
Jim
JDKJimK illeen
J. Diw Killeen
Reference
System A
System B
Kileen, J. D.
No TCTC
3
6
1
4
Edit-dist
81858973717982B
90788062618279A
No TCTCNo TCTC
CEAF
(TC)
B-CUBED
(TC)
MUC-6
(TC)
Pairwise F% Correct
My preference

25
Conclusion
 Identity resolution systems face linguistic
challenges
 They need to be carefully integrated to meet
these challenges
 Evaluation corpora should reflect these
challenges
 Evaluation metrics should align with qualitative
judgements

26
Bibliography
Bagga, A., Baldwin., B. (1998). Algorithms for scoring coreference chains. In
Proceedings of the First International Conference on Language Resources
and Evaluation Workshop on Linguistic Coreference.
Fellegi, I. P., Sunter, A. B. (1969). A theory for record linkage. Journal of the
American Statistical Association, Vol. 64, No. 328, pp. 1183--1210.
Luo, X. (2005). On coreference resolution performance metrics. In Proc. of
HLT-EMNLP, pp 25--32.
Menestrina, D., Benjelloun, O., Garcia-Molina, H. (2006). Generic entity
resolution with data confidences. In First International VLDB Workshop on
Clean Databases. Seoul, Korea.
Murgatroyd, D. Some Linguistic Considerations of Entity Resolution and
Retrieval. In Proceedings of LREC 2008 Workshop on Resources and
Evaluation for Identity Matching, Entity Resolution and Entity
Management.
Spock Team (2008). The Spock Challenge. http://challenge.spock.com/
(Retrieved February 5.)
Vilain, M. Burger, J. Aberdeen, J. Connolly, D., Hirschman, L. (1995). A
model-theoretic coreference scoring scheme. In Proceedings of the 6th
Message Understanding Conference (MUC6). Morgan Kaufmann, pp. 45--52.

27
Questions?
More information:
http://www.basistech.com

Linguistic Considerations of Identity Resolution (2008)

More Related Content

Viewers also liked (13)

Similar to Linguistic Considerations of Identity Resolution (2008) (20)

More from David Murgatroyd (13)

Recently uploaded (20)

Linguistic Considerations of Identity Resolution (2008)