W H AT ’ S I N A N A M E ?
P H O N E T I C A L G O R I T H M S
F O R S E A R C H A N D S I M I L A R I T Y
Mercedes Coyle
@benzobot
Data Infrastructure Engineer
W H AT I ’ M G O I N G T O C O V E R T O D AY
• Search - how does it work?
• Phonetic Algorithms
• Use cases for Phonetic Algorithms
W H E N W E T H I N K O F S E A R C H …
H O W D O E S G O O G L E
S E A R C H W O R K ?
• Web crawling on a very large
scale!
• Document rank (importance)
and similarity
• Text analysis
image credit: flickr.com/photos/rserrano/
• Obligatory Hand-wavey
“Big Data” comment
here
H O W D O E S G O O G L E
S E A R C H W O R K ?
image credit: twitter.com/wtrsld/status/424364245648564226
D ATA B A S E S E A R C H
image credit: Mercedes Coyle
* S Q L
• Comparison search: LIKE operator
• SELECT * FROM table WHERE word LIKE %and%
* S Q L
• Comparison search: LIKE operator
• basically a wildcard character search
• only returns data that contains the search string;
does not account for misspelling
• can be expensive on large datasets
* S Q L
E L A S T I C S E A R C H - T O K E N I Z AT I O N
• Used in full-text search against a corpus of text
• “The quick brown fox jumped over the lazy dog”
• the, quick, brown, fox, jump, over, lazy, dog
• Wildcard searches return too many results
• Typos or misspelled names don’t return correct results
• exp: “Shawn” vs “Sean”
P R O B L E M : T E X T- B A S E D S E A R C H E S D O N ’ T
A LWAY S W O R K W E L L W I T H N A M E S
W H AT I S A P H O N E M E ?
• In language, the smallest unit that conveys distinct
meaning
• Includes single letters, letter combinations, vowels and
consonants
E N G L I S H P H O N E M E S
H O W D O W E T R A N S L AT E P H O N E M E S
C O D E ?
image credit: demoons.com/2010/09/first-animation-test.html
P H O N E T I C A L G O R I T H M S
• A method of hashing words and names based on
sounds (phonemes).
P H O N E T I C A L G O R I T H M T Y P E S
• Soundex
• NYSIIS
• Metaphone and Double Metaphone
• Match Rating, Daitch-Mokotoff Soundex, Kölner
Phonetik, Caverphone…
S O U N D E X
• Designed in the 1900’s to encode names for the US
Census
• Built in to PostgreSQL and MySQL
S O U N D E X A L G O R I T H M
Mercedes = MERCEDES
MERCEDES = M0620302
{ 0 : [’A’, E', 'I', 'O', 'U', 'H', 'W', ‘Y’], 1 : [ 'B', 'F', 'P', ‘V’], 2
: ['C', 'G', 'J', 'K', 'Q', 'S', 'X', ‘Z’], 3 : [‘D’,’T’], 4 : [‘L’], 5 :
[‘M’,’N’], 6 : [‘R’] }
M0620302 = M6232
M6232 = M623
S O U N D E X L I M I TAT I O N S
• Most implementations work
for English Language only
• First letter retention causes
no match on some similar
names
S O U N D E X L I M I TAT I O N S
• Postgres Soundex implementation has limited
character encoding support
http://www.postgresql.org/docs/9.4/static/fuzzystrmatch.html
N Y S I I S
• Developed in 1970, part of New York State
Identification and Intelligence System
• Slightly improved functionality over Soundex
N Y S I I S A L G O R I T H M
N Y S I I S A L G O R I T H M
• MERCEDES
• MARCADAS
• MARCADA
• MARCAD
N Y S I I S
M E TA P H O N E
• Developed in 1990 by Lawrence Philips
• Improved accuracy over Soundex and NYSIIS
• Double Metaphone implements two hashes for each
name or word
M E TA P H O N E
M E TA P H O N E
• Metaphone and Double Metaphone were improved
upon in Metaphone 3, which is unfortunately closed
source.
P H O N E T I C A L G O R I T H M S I N P R A C T I C E
• Use cases for Phonetic Algorithms
• Example uses in Databases
P H O N E T I C A L G O R I T H M S I N P R A C I T C E
• Phonetic algorithms are useful for searching by name
or word, and tolerate some misspelling.
P H O N E T I C A L G O R I T H M S I N P R A C I T C E
• Store the phonetic hash of a name in fields/columns in
your db for indexing and querying
{ "_id" : ObjectId("53e13a73cbcc7a0a6e3078e5"),
"first_name" : "Arya", "last_name" : “Stark",
"n_first_name" : “AR", "n_last_name" : “STARC”,
“report” : “lost_item”, “item” : “ID Card”,
"timestamp" : 1407269491, "report_id" : 50642 }
P H O N E T I C S E A R C H W I T H
E L A S T I C S E A R C H
• Elasticsearch has support for Phonetic Matches, in
many different languages!
• Store words/names as documents, and hashing is
done at query time
GET /my_index/_analyze?analyzer=dbl_metaphone
returns: Smith Smythe
P H O N E T I C S E A R C H U S I N G
E L A S T I C S E A R C H
• As a Developer, I really like using Elasticsearch!
• But as a System Administrator, I have battle scars.
P H O N E T I C A L G O R I T H M S F O R N O N
E N G L I S H L A N G U A G E S
Grab a linguist and write one?
image credit: flickr.com/photos/opacity
R E S O U R C E S
• Libraries
• clj-fuzzy: yomguithereal.github.io/clj-fuzzy/
• python soundex: pypi.python.org/pypi/soundex/1.1.3
• python fuzzy: pypi.python.org/pypi/Fuzzy
• elasticsearch phonetic matching https://www.elastic.co/guide/en/elasticsearch/guide/
current/phonetic-matching.html
• http://aspell.net/metaphone/dmetaph.cpp
• Reading:
• http://doughellmann.com/2012/03/03/using-fuzzy-matching-to-search-by-sound-with-
python.html
• Fluency, Jen Feohner Wells - http://www.jenniferfoehnerwells.com/fluency.html
T H A N K S F O R L I S T E N I N G !
Q U E S T I O N S ?
Mercedes Coyle
@benzobot
image credit: Mercedes Coyle

More Related Content

PPTX
Regular Expressions for SEO
PDF
How elephants survive in big data environments
PPTX
Font research
PDF
Fluency
PPTX
PPTX
Digital Twin: jSON-LD, RDF
PDF
Programming != Writing Code
PPTX
Mashed Up Playlist
Regular Expressions for SEO
How elephants survive in big data environments
Font research
Fluency
Digital Twin: jSON-LD, RDF
Programming != Writing Code
Mashed Up Playlist

Viewers also liked (20)

PPTX
Horror Movie poster analysis
PPTX
Sm cemex final_1
PDF
Probability cheatsheet
PDF
Positive displacement flowmeter gear type for high viscosity liquids
PPTX
Presentazione PSCE - Gruppo 1 - 17.03.17
PDF
Our journey from good to great
PDF
Instrumentos que motivacion con confiabilidad
PPTX
PPT
Cultura y política en méxico
PPTX
3Com 7000-10067
PDF
Notas semana cultural padres
DOCX
Quimica industrial 1 3
PDF
Marco del Buen Desempeño Docente
PPTX
Desarrollo científico y tecnológico de colombia bryhan
PDF
Presentation eleven
PPTX
Denges drudzis
PPTX
Desarrollo científico y tecnológico de colombia luis
PPTX
Stages of tooth development
PPTX
3Com 3C13751-US
DOC
History of british newspapers
Horror Movie poster analysis
Sm cemex final_1
Probability cheatsheet
Positive displacement flowmeter gear type for high viscosity liquids
Presentazione PSCE - Gruppo 1 - 17.03.17
Our journey from good to great
Instrumentos que motivacion con confiabilidad
Cultura y política en méxico
3Com 7000-10067
Notas semana cultural padres
Quimica industrial 1 3
Marco del Buen Desempeño Docente
Desarrollo científico y tecnológico de colombia bryhan
Presentation eleven
Denges drudzis
Desarrollo científico y tecnológico de colombia luis
Stages of tooth development
3Com 3C13751-US
History of british newspapers
Ad

Similar to Phonetic algorithms os_bridge_2015 (20)

PDF
Improvement of Soundex Algorithm for Indian Language Based on Phonetic Matching
PDF
Improvement of soundex algorithm for indian language based on phonetic matching
PPTX
Taxonomies in Search
PPTX
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
PDF
Canonical Formatted Address Data
PDF
Canonical Formatted Address Data
PPT
Audio mining
PDF
Full text search
PDF
Wyszukiwanie w plikach audio
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PPTX
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
PPTX
Wreck a nice beach: adventures in speech recognition
PPT
Semantic Search Component
PPTX
Spoken Content Retrieval
PPTX
Lecture 7- Text Statistics and Document Parsing
PDF
14. Michael Oakes (UoW) Natural Language Processing for Translation
PDF
Query by Example of Speaker Audio Signals using Power Spectrum and MFCCs
PPTX
From Queries to Answers in the Web
PDF
Find it, possibly also near you!
PDF
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Improvement of Soundex Algorithm for Indian Language Based on Phonetic Matching
Improvement of soundex algorithm for indian language based on phonetic matching
Taxonomies in Search
Doing Something We Never Could with Spoken Language Technologies_109-10-29_In...
Canonical Formatted Address Data
Canonical Formatted Address Data
Audio mining
Full text search
Wyszukiwanie w plikach audio
AI-Powered Linguistics and Search with Fusion and Rosette
MIT Program on Information Science Talk -- Ophir Frieder on Searching in Hars...
Wreck a nice beach: adventures in speech recognition
Semantic Search Component
Spoken Content Retrieval
Lecture 7- Text Statistics and Document Parsing
14. Michael Oakes (UoW) Natural Language Processing for Translation
Query by Example of Speaker Audio Signals using Power Spectrum and MFCCs
From Queries to Answers in the Web
Find it, possibly also near you!
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Ad

Recently uploaded (20)

PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PPTX
communication and presentation skills 01
PPTX
introduction to high performance computing
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPT
Total quality management ppt for engineering students
PPTX
Module 8- Technological and Communication Skills.pptx
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
Current and future trends in Computer Vision.pptx
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PPTX
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
communication and presentation skills 01
introduction to high performance computing
III.4.1.2_The_Space_Environment.p pdffdf
Total quality management ppt for engineering students
Module 8- Technological and Communication Skills.pptx
Visual Aids for Exploratory Data Analysis.pdf
Fundamentals of safety and accident prevention -final (1).pptx
Current and future trends in Computer Vision.pptx
Exploratory_Data_Analysis_Fundamentals.pdf
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
Categorization of Factors Affecting Classification Algorithms Selection
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
Chemical Technological Processes, Feasibility Study and Chemical Process Indu...
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...

Phonetic algorithms os_bridge_2015

  • 1. W H AT ’ S I N A N A M E ? P H O N E T I C A L G O R I T H M S F O R S E A R C H A N D S I M I L A R I T Y Mercedes Coyle @benzobot Data Infrastructure Engineer
  • 2. W H AT I ’ M G O I N G T O C O V E R T O D AY • Search - how does it work? • Phonetic Algorithms • Use cases for Phonetic Algorithms
  • 3. W H E N W E T H I N K O F S E A R C H …
  • 4. H O W D O E S G O O G L E S E A R C H W O R K ? • Web crawling on a very large scale! • Document rank (importance) and similarity • Text analysis image credit: flickr.com/photos/rserrano/
  • 5. • Obligatory Hand-wavey “Big Data” comment here H O W D O E S G O O G L E S E A R C H W O R K ? image credit: twitter.com/wtrsld/status/424364245648564226
  • 6. D ATA B A S E S E A R C H image credit: Mercedes Coyle
  • 7. * S Q L • Comparison search: LIKE operator • SELECT * FROM table WHERE word LIKE %and%
  • 8. * S Q L • Comparison search: LIKE operator • basically a wildcard character search • only returns data that contains the search string; does not account for misspelling • can be expensive on large datasets
  • 9. * S Q L
  • 10. E L A S T I C S E A R C H - T O K E N I Z AT I O N • Used in full-text search against a corpus of text • “The quick brown fox jumped over the lazy dog” • the, quick, brown, fox, jump, over, lazy, dog
  • 11. • Wildcard searches return too many results • Typos or misspelled names don’t return correct results • exp: “Shawn” vs “Sean” P R O B L E M : T E X T- B A S E D S E A R C H E S D O N ’ T A LWAY S W O R K W E L L W I T H N A M E S
  • 12. W H AT I S A P H O N E M E ? • In language, the smallest unit that conveys distinct meaning • Includes single letters, letter combinations, vowels and consonants
  • 13. E N G L I S H P H O N E M E S
  • 14. H O W D O W E T R A N S L AT E P H O N E M E S C O D E ? image credit: demoons.com/2010/09/first-animation-test.html
  • 15. P H O N E T I C A L G O R I T H M S • A method of hashing words and names based on sounds (phonemes).
  • 16. P H O N E T I C A L G O R I T H M T Y P E S • Soundex • NYSIIS • Metaphone and Double Metaphone • Match Rating, Daitch-Mokotoff Soundex, Kölner Phonetik, Caverphone…
  • 17. S O U N D E X • Designed in the 1900’s to encode names for the US Census • Built in to PostgreSQL and MySQL
  • 18. S O U N D E X A L G O R I T H M Mercedes = MERCEDES MERCEDES = M0620302 { 0 : [’A’, E', 'I', 'O', 'U', 'H', 'W', ‘Y’], 1 : [ 'B', 'F', 'P', ‘V’], 2 : ['C', 'G', 'J', 'K', 'Q', 'S', 'X', ‘Z’], 3 : [‘D’,’T’], 4 : [‘L’], 5 : [‘M’,’N’], 6 : [‘R’] } M0620302 = M6232 M6232 = M623
  • 19. S O U N D E X L I M I TAT I O N S • Most implementations work for English Language only • First letter retention causes no match on some similar names
  • 20. S O U N D E X L I M I TAT I O N S • Postgres Soundex implementation has limited character encoding support http://www.postgresql.org/docs/9.4/static/fuzzystrmatch.html
  • 21. N Y S I I S • Developed in 1970, part of New York State Identification and Intelligence System • Slightly improved functionality over Soundex
  • 22. N Y S I I S A L G O R I T H M
  • 23. N Y S I I S A L G O R I T H M • MERCEDES • MARCADAS • MARCADA • MARCAD
  • 24. N Y S I I S
  • 25. M E TA P H O N E • Developed in 1990 by Lawrence Philips • Improved accuracy over Soundex and NYSIIS • Double Metaphone implements two hashes for each name or word
  • 26. M E TA P H O N E
  • 27. M E TA P H O N E • Metaphone and Double Metaphone were improved upon in Metaphone 3, which is unfortunately closed source.
  • 28. P H O N E T I C A L G O R I T H M S I N P R A C T I C E • Use cases for Phonetic Algorithms • Example uses in Databases
  • 29. P H O N E T I C A L G O R I T H M S I N P R A C I T C E • Phonetic algorithms are useful for searching by name or word, and tolerate some misspelling.
  • 30. P H O N E T I C A L G O R I T H M S I N P R A C I T C E • Store the phonetic hash of a name in fields/columns in your db for indexing and querying { "_id" : ObjectId("53e13a73cbcc7a0a6e3078e5"), "first_name" : "Arya", "last_name" : “Stark", "n_first_name" : “AR", "n_last_name" : “STARC”, “report” : “lost_item”, “item” : “ID Card”, "timestamp" : 1407269491, "report_id" : 50642 }
  • 31. P H O N E T I C S E A R C H W I T H E L A S T I C S E A R C H • Elasticsearch has support for Phonetic Matches, in many different languages! • Store words/names as documents, and hashing is done at query time GET /my_index/_analyze?analyzer=dbl_metaphone returns: Smith Smythe
  • 32. P H O N E T I C S E A R C H U S I N G E L A S T I C S E A R C H • As a Developer, I really like using Elasticsearch! • But as a System Administrator, I have battle scars.
  • 33. P H O N E T I C A L G O R I T H M S F O R N O N E N G L I S H L A N G U A G E S Grab a linguist and write one? image credit: flickr.com/photos/opacity
  • 34. R E S O U R C E S • Libraries • clj-fuzzy: yomguithereal.github.io/clj-fuzzy/ • python soundex: pypi.python.org/pypi/soundex/1.1.3 • python fuzzy: pypi.python.org/pypi/Fuzzy • elasticsearch phonetic matching https://www.elastic.co/guide/en/elasticsearch/guide/ current/phonetic-matching.html • http://aspell.net/metaphone/dmetaph.cpp • Reading: • http://doughellmann.com/2012/03/03/using-fuzzy-matching-to-search-by-sound-with- python.html • Fluency, Jen Feohner Wells - http://www.jenniferfoehnerwells.com/fluency.html
  • 35. T H A N K S F O R L I S T E N I N G ! Q U E S T I O N S ? Mercedes Coyle @benzobot image credit: Mercedes Coyle