SlideShare a Scribd company logo
Charlie Hull - Managing Director, Flax
Harry Waye – CTO
Adam Schakaki - Developer, Arachnys
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
@FlaxSearch
@arachnys
Finding the Bad Actor:
Custom scoring & forensic name matching
with Elasticsearch
• We build, tune and support fast, accurate and highly scalable
search, analytics and Big Data applications
• We use (and create) open source software
• We're independent, honest and have 16+ years experience
• We also:
– Run and attend many events & conferences
– Write extensively about search & related matters
– Train and mentor
Finding the Bad Actor: Custom scoring & forensic name matching  with Elasticsearch
Arachnys are an investigations technology company based in
London & NY. They employ robotic process automation to
enhance investigations and automate manual work.
Arachnys is a technology company that simplifies the process
of conducting enhanced due diligence. Our clients use us to
manage their regulatory and reputation based risk in-house,
going far above and beyond basic sanctions and PEP list
screening.
Arachnys use a 23,000+ source library of risk-relevant
information, searching in 97 languages and in 120+ countries. It
is the industry leader in emerging markets.
Who are ?
@FlaxSearch
• A two-part project:
1.Name matching in free text -an Elasticsearch plugin to
find names & adverse terms
2.Replicating Relevancy - replicate with Elasticsearch how
a third party service scores name matches
•Complicating factors:
1.Scores need to be “comparable” between searches
2.The client had to understand how this all worked
...so we're going to try and explain it to you!
Finding the Bad Actor
@FlaxSearch
HW
But first...a short demonstration....
@FlaxSearch
CH
• An example:
– Query for: “William Gates” near “money launder”
– Match: “Bill Henry Gates accused of money
laundering”
– Do not match: “Bills gates linked to money
launder accused”
Name matching in free text
@FlaxSearch
• An example (2)
– Query string approximation:
“William Gates”~2 AND “money launder”
– Problems:
■ We want proximity between the words in “William
Gates” and between the phrases “William
Gates” and “money launder”
■ We want fine grained control of scores
■ Lots of analyzer subtleties: plurals, stemming, middle
names, typos
Name matching in free text
@FlaxSearch
• The solution - DPeso algorithm
– SpanQuery to find a name
– SpanQuery to find one of a set of adverse terms
– Score documents by:
• Frequency & position of matches of both SpanQueries
• Minimum distance between a match for each
SpanQuery
• Name exactness: Synonyms, name permutations,
middle names
Name matching in free text
@FlaxSearch
Hang on, what's a SpanQuery?
SpanNearQuery spanNear =
new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "money")),
new SpanTermQuery(new Term(FIELD, "laundering"))},
4,
True);
Name matching in free text
@FlaxSearch
Hang on, what's a SpanQuery?
SpanNearQuery spanNear =
new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "money")),
new SpanTermQuery(new Term(FIELD, "laundering"))},
4, << slop distance
True); << true if order important
Name matching in free text
@FlaxSearch
Hang on, what's a SpanQuery?
SpanNearQuery spanNear =
new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "money")),
new SpanTermQuery(new Term(FIELD, "laundering"))},
4, << slop distance
True); << true if order important
matches “money laundering” but not “laundering business
makes money”
Name matching in free text
@FlaxSearch
In Elasticsearch:
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "field" : "money" } },
{ "span_term" : { "field" : "laundering" } },
],
"slop" : 4,
"in_order" : true
}
}
}
Name matching in free text
@FlaxSearch
In Elasticsearch:
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "field" : "money" } },
{ "span_term" : { "field" : "laundering" } },
],
"slop" : 4,
"in_order" : true
}
}
}
– Note there are other types of SpanQueries, and you can
combine them in various ways as you might expect
Name matching in free text
@FlaxSearch
• But...we needed to extend SpanQueries
– Standard Lucene SpanQueries score using basic
frequency
– But we care where a match appears
– We also care whether it's the actual word or a synonym
that matches
Name matching in free text
@FlaxSearch
• But...we needed to extend SpanQueries
– Standard Lucene SpanQueries score using basic
frequency
– But we care where a match appears
– We also care whether it's the actual word or a synonym
that matches
• So we extended SpanQuery scoring
– Earlier hits count for more
– Synonym matches count for less
Name matching in free text
@FlaxSearch
• But...we needed to extend SpanQueries
– Standard Lucene SpanQueries score using basic
frequency
– But we care where a match appears
– We also care whether it's the actual word or a synonym
that matches
• So we extended SpanQuery scoring
– Earlier hits count for more
– Synonym matches count for less
•Implemented twice for Lucene 4 and Lucene 6 due to Span
API changes (Alan’s fault)
Name matching in free text
@FlaxSearch
•Created a new Elasticsearch query
•entity_search_query combines two (new, improved)
SpanQueries
•Scoring is additive – the more adverse terms found near the
name, the higher the score
Name matching in free text
@FlaxSearch
• entity_search query
{
"query": {
"entity_search": {
"name",
"adverse_terms",
"fields",
"analyzers",
"fuzziness",
"slop",
"boost",
"synonym_boost",
"order_bias"
}
}
}
Name matching in free text
@FlaxSearch
• entity_search query
{
"query": {
"entity_search": {
"name": "Mr Benn",
"adverse_terms": [ "shopkeeper", "hat", "costume shop"],
"fields",
"analyzers",
"fuzziness",
"slop",<< slop distance
"boost", << default is 1
"synonym_boost", << default is less than 1, to score synonyms lower
"order_bias"<< if 0,only given order matches, or use to score out-of-order matches
lower
}
}
}
Name matching in free text
@FlaxSearch
HW
@FlaxSearch
@FlaxSearch
● Why does it need to be explainable?
○ Finance organisations need to explain to regulators why
they may or may not have considered a particular hit
○ Need to be able to limit the data volume across all
searches in a way that is defensible
● Why doesn’t standard scoring cut it?
○ Dependent on field stats of entire dataset, not exactly
transparent or obvious
○ Does not take into account “risk”, i.e. severity of
particular term, only frequency
○ Does not allow setting a “global” threshold
Name matching in free text
@FlaxSearch
CH
Replicating Relevancy
@FlaxSearch
HW
● Arachnys' client required a high volume monitoring solution
○ Based on a third party data set and name matching logic
○ Name matches scored from 0-100%
○ Supports:
■ Synonyms (William, Bill, Wilhelm etc.)
■ Typos
■ Initial matches
■ Missing spaces, amongst other edge cases
Replicating Relevancy
@FlaxSearch
● Arachnys' client required a high volume monitoring solution
○ Based on a third party data set and name matching logic
○ Name matches scored from 0-100%
○ Supports:
■ Synonyms (William, Bill, Wilhelm etc.)
■ Typos
■ Initial matches
■ Missing spaces, amongst other edge cases
● Arachnys wanted to implement the third parties relevancy
calculations in Elasticsearch
○ People said it couldn't be done
○ People said it shouldn’t be done
Replicating Relevancy
@FlaxSearch
• Custom indexing
• Custom query parser
• A proxy for Elasticsearch
• Searchkit GUI www.searchkit.co
Replicating Relevancy
@FlaxSearch
• Custom indexing
– Source data is in Accuity format XML
– Indexer written in Python
• Normalises dates, addresses, IDs
• Sorts out mappings
• Pushes data to Elasticsearch
– Extensive multi-field usage to index:
• Initials
• Stripped space
• Word counts
• Phonetics
• Synonyms
• ICU normalization
Replicating Relevancy
@FlaxSearch
• Custom query parser written in Python:
– Split terms using and search for them individually
– Also do a phrase search
– Features such as 'ignore initials'
– Scoring:
• If terms are on 'uninteresting' list lower score
– e.g. “PLC”, “ltd”
• Normalise to 0->1 using function query
• Optional minimum score can be passed in
Replicating Relevancy
@FlaxSearch
• A proxy for Elasticsearch
– Uses Flask web service framework
– Endpoints:
• /healthcheck
• /<indexname>/_debug
• /<indexname>/_search
• /<indexname>/_explain
• /<indexname>/<doc_id>
Replicating Relevancy
@FlaxSearch
• A proxy for Elasticsearch
– Uses Flask web service framework
– Endpoints:
• /healthcheck << do some random query, if it return results then OK
• /<indexname>/_debug << show what the parser returns
• /<indexname>/_search << parse the query and do a search
• /<indexname>/_explain << do a search with 'explain' on
• /<indexname>/<doc_id> << return a document
Replicating Relevancy
@FlaxSearch
• A wrinkle – Korean name searching
– Korean names can be written as Hangul
– Hangul doesn't use whitespace
– This breaks the ICU tokeniser
• Solution:
– Check input string for Hangul
– Use n-grams
• Works better than the system we're replicating!
Replicating Relevancy
@FlaxSearch
HW
•Examples
○ STEVENS, DAVID
■ David Stevens should be 100%
■ David John Stevens should be in the 90s
■ Dave John Stevens should be less than that due
to synonyms
■ David Johnston shouldn’t match
Replicating Relevancy
@FlaxSearch
•Examples (2)
○ Arachnys Entertainment & Resorts Co
■ Arachnys Entertainment & Resorts Co Ltd
should be 100%
■ Arachnys Entertainment & Resorts should be
very close to 100%
■ Arachnys should be over 50% due to brand
recognition
■ Arachnys Information Systems below 50%
due to mismatching terms
Replicating Relevancy
@FlaxSearch
• Tested using Quepid www.quepid.com
Replicating Relevancy
@FlaxSearch
● Name matching in free text:
○ Entity extraction (or basic capitalized phrase extraction)
to better support more advanced name matching: it’s
difficult to do anything nicer without incurring a lot of
noise
● Replicating relevancy
○ Ensuring ElasticSearch handles analysis; client side
analysis is a minefield
○ Generalizing to a self contained name analyzer
The future...
@FlaxSearch
CH
Thankyou!
Any questions?
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch
● UX Designer
● Front-end Developer
● Senior Software Engineer
Apply at https://arachnys.workable.com/
@arachnys
AS
Arachnys is hiring for:

More Related Content

PDF
Digital 2022 Norway (February 2022) v01
PDF
Digital 2022 United Kingdom (February 2022) v02
PDF
Digital 2022: Essential Pinterest Stats for Q1 2022 v01
PDF
Digital 2022 South Korea (February 2022) v01
PDF
Digital 2022 Gabon (February 2022) v01
PDF
Digital 2021 United Kingdom (January 2021) v01
PDF
Digital 2022 Macau (February 2022) v01
PDF
Mundo digital España 2022
Digital 2022 Norway (February 2022) v01
Digital 2022 United Kingdom (February 2022) v02
Digital 2022: Essential Pinterest Stats for Q1 2022 v01
Digital 2022 South Korea (February 2022) v01
Digital 2022 Gabon (February 2022) v01
Digital 2021 United Kingdom (January 2021) v01
Digital 2022 Macau (February 2022) v01
Mundo digital España 2022

What's hot (20)

PDF
Digital 2022 China (February 2022) v01
PDF
Digital 2022 France (February 2022) v02
PDF
Digital 2022 Lebanon (February 2022) v01
PDF
Digital 2022 Local Country Headlines Report (January 2022) v02
PDF
Digital 2022 Italy (February 2022) v02
PDF
Digital 2022 Tajikistan (February 2022) v01
PDF
Digital 2022 Austria (February 2022) v01
PDF
Digital 2022 Chad (February 2022) v01
PDF
Digital 2022: Essential LinkedIn Stats for Q1 2022 v01
PDF
Digital 2022: Essential Twitter Stats for Q1 2022 v01
PDF
Digital 2022: Essential YouTube Stats for Q1 2022 v01
PDF
Digital 2023 Kazakhstan (February 2023) v01
PDF
Digital 2022 Syria (February 2022) v01
PDF
Digital 2022 Qatar (February 2022) v01
PDF
Digital 2022 Algeria (February 2022) v01
PDF
Digital 2022: Essential TikTok Stats for Q1 2022 v01
PDF
Digital 2021 Algeria (January 2021) v02
PDF
Digital 2023 Australia (February 2023) v01
PDF
Digital 2022 Libya (February 2022) v01
PDF
Digital 2022 United States of America (February 2022) v02
Digital 2022 China (February 2022) v01
Digital 2022 France (February 2022) v02
Digital 2022 Lebanon (February 2022) v01
Digital 2022 Local Country Headlines Report (January 2022) v02
Digital 2022 Italy (February 2022) v02
Digital 2022 Tajikistan (February 2022) v01
Digital 2022 Austria (February 2022) v01
Digital 2022 Chad (February 2022) v01
Digital 2022: Essential LinkedIn Stats for Q1 2022 v01
Digital 2022: Essential Twitter Stats for Q1 2022 v01
Digital 2022: Essential YouTube Stats for Q1 2022 v01
Digital 2023 Kazakhstan (February 2023) v01
Digital 2022 Syria (February 2022) v01
Digital 2022 Qatar (February 2022) v01
Digital 2022 Algeria (February 2022) v01
Digital 2022: Essential TikTok Stats for Q1 2022 v01
Digital 2021 Algeria (January 2021) v02
Digital 2023 Australia (February 2023) v01
Digital 2022 Libya (February 2022) v01
Digital 2022 United States of America (February 2022) v02
Ad

Similar to Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch (20)

PDF
Serverless Text Analytics with Amazon Comprehend
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PPT
Turning search upside down with powerful open source search software
PPT
Advanced full text searching techniques using Lucene
PPTX
Eureka, I found it! - Special Libraries Association 2021 Presentation
PDF
Webinar: Simpler Semantic Search with Solr
ODP
Search Solutions 2015: Towards a new model of search relevance testing
PPTX
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
PDF
Search explained T3DD15
PPTX
Scalable Data Models with Elasticsearch
PDF
Distributed Natural Language Processing Systems in Python
PPTX
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
PPT
Elasticsearch for Westcoast
PPTX
PPTX
FHIR intro and background at HL7 Germany 2014
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
PDF
The Psychology of Security Automation
PPTX
Case study of Rujhaan.com (A social news app )
PPTX
Maximizing the Exposure of your Research
PDF
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...
Serverless Text Analytics with Amazon Comprehend
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Turning search upside down with powerful open source search software
Advanced full text searching techniques using Lucene
Eureka, I found it! - Special Libraries Association 2021 Presentation
Webinar: Simpler Semantic Search with Solr
Search Solutions 2015: Towards a new model of search relevance testing
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Search explained T3DD15
Scalable Data Models with Elasticsearch
Distributed Natural Language Processing Systems in Python
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Elasticsearch for Westcoast
FHIR intro and background at HL7 Germany 2014
Scaling Recommendations, Semantic Search, & Data Analytics with solr
The Psychology of Security Automation
Case study of Rujhaan.com (A social news app )
Maximizing the Exposure of your Research
Beginner-friendly Guide to ML-enabled Automation in Organic Marketing, Lazari...
Ad

More from Charlie Hull (10)

PPTX
Lucene, Solr and java 9 - opportunities and challenges
PPT
Making sense of big data
PPT
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
PPT
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
PPT
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
PPT
Bio solr building a better search for bioinformatics
PPT
Solr and Elasticsearch, a performance study
PPT
Intranet show and_tell_2010
PPT
Flax ovum search-across_the_enterprise
ODP
What's the story with Open Source?
Lucene, Solr and java 9 - opportunities and challenges
Making sense of big data
FIBEP WMIC 2015 - How Infomedia upgraded their closed-source search engine to...
Enterprise Search Europe 2015: Fishing the big data streams - the future of ...
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
Bio solr building a better search for bioinformatics
Solr and Elasticsearch, a performance study
Intranet show and_tell_2010
Flax ovum search-across_the_enterprise
What's the story with Open Source?

Recently uploaded (20)

PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
AI in Product Development-omnex systems
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
System and Network Administraation Chapter 3
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Nekopoi APK 2025 free lastest update
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Digital Strategies for Manufacturing Companies
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
System and Network Administration Chapter 2
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
Upgrade and Innovation Strategies for SAP ERP Customers
AI in Product Development-omnex systems
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
System and Network Administraation Chapter 3
How to Migrate SBCGlobal Email to Yahoo Easily
How Creative Agencies Leverage Project Management Software.pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Nekopoi APK 2025 free lastest update
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Adobe Illustrator 28.6 Crack My Vision of Vector Design
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Digital Strategies for Manufacturing Companies
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
ManageIQ - Sprint 268 Review - Slide Deck
ISO 45001 Occupational Health and Safety Management System
System and Network Administration Chapter 2
Online Work Permit System for Fast Permit Processing
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Which alternative to Crystal Reports is best for small or large businesses.pdf

Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch

  • 1. Charlie Hull - Managing Director, Flax Harry Waye – CTO Adam Schakaki - Developer, Arachnys charlie@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 @FlaxSearch @arachnys Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch
  • 2. • We build, tune and support fast, accurate and highly scalable search, analytics and Big Data applications • We use (and create) open source software • We're independent, honest and have 16+ years experience • We also: – Run and attend many events & conferences – Write extensively about search & related matters – Train and mentor
  • 4. Arachnys are an investigations technology company based in London & NY. They employ robotic process automation to enhance investigations and automate manual work. Arachnys is a technology company that simplifies the process of conducting enhanced due diligence. Our clients use us to manage their regulatory and reputation based risk in-house, going far above and beyond basic sanctions and PEP list screening. Arachnys use a 23,000+ source library of risk-relevant information, searching in 97 languages and in 120+ countries. It is the industry leader in emerging markets. Who are ? @FlaxSearch
  • 5. • A two-part project: 1.Name matching in free text -an Elasticsearch plugin to find names & adverse terms 2.Replicating Relevancy - replicate with Elasticsearch how a third party service scores name matches •Complicating factors: 1.Scores need to be “comparable” between searches 2.The client had to understand how this all worked ...so we're going to try and explain it to you! Finding the Bad Actor @FlaxSearch HW
  • 6. But first...a short demonstration.... @FlaxSearch CH
  • 7. • An example: – Query for: “William Gates” near “money launder” – Match: “Bill Henry Gates accused of money laundering” – Do not match: “Bills gates linked to money launder accused” Name matching in free text @FlaxSearch
  • 8. • An example (2) – Query string approximation: “William Gates”~2 AND “money launder” – Problems: ■ We want proximity between the words in “William Gates” and between the phrases “William Gates” and “money launder” ■ We want fine grained control of scores ■ Lots of analyzer subtleties: plurals, stemming, middle names, typos Name matching in free text @FlaxSearch
  • 9. • The solution - DPeso algorithm – SpanQuery to find a name – SpanQuery to find one of a set of adverse terms – Score documents by: • Frequency & position of matches of both SpanQueries • Minimum distance between a match for each SpanQuery • Name exactness: Synonyms, name permutations, middle names Name matching in free text @FlaxSearch
  • 10. Hang on, what's a SpanQuery? SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(FIELD, "money")), new SpanTermQuery(new Term(FIELD, "laundering"))}, 4, True); Name matching in free text @FlaxSearch
  • 11. Hang on, what's a SpanQuery? SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(FIELD, "money")), new SpanTermQuery(new Term(FIELD, "laundering"))}, 4, << slop distance True); << true if order important Name matching in free text @FlaxSearch
  • 12. Hang on, what's a SpanQuery? SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(FIELD, "money")), new SpanTermQuery(new Term(FIELD, "laundering"))}, 4, << slop distance True); << true if order important matches “money laundering” but not “laundering business makes money” Name matching in free text @FlaxSearch
  • 13. In Elasticsearch: { "query": { "span_near" : { "clauses" : [ { "span_term" : { "field" : "money" } }, { "span_term" : { "field" : "laundering" } }, ], "slop" : 4, "in_order" : true } } } Name matching in free text @FlaxSearch
  • 14. In Elasticsearch: { "query": { "span_near" : { "clauses" : [ { "span_term" : { "field" : "money" } }, { "span_term" : { "field" : "laundering" } }, ], "slop" : 4, "in_order" : true } } } – Note there are other types of SpanQueries, and you can combine them in various ways as you might expect Name matching in free text @FlaxSearch
  • 15. • But...we needed to extend SpanQueries – Standard Lucene SpanQueries score using basic frequency – But we care where a match appears – We also care whether it's the actual word or a synonym that matches Name matching in free text @FlaxSearch
  • 16. • But...we needed to extend SpanQueries – Standard Lucene SpanQueries score using basic frequency – But we care where a match appears – We also care whether it's the actual word or a synonym that matches • So we extended SpanQuery scoring – Earlier hits count for more – Synonym matches count for less Name matching in free text @FlaxSearch
  • 17. • But...we needed to extend SpanQueries – Standard Lucene SpanQueries score using basic frequency – But we care where a match appears – We also care whether it's the actual word or a synonym that matches • So we extended SpanQuery scoring – Earlier hits count for more – Synonym matches count for less •Implemented twice for Lucene 4 and Lucene 6 due to Span API changes (Alan’s fault) Name matching in free text @FlaxSearch
  • 18. •Created a new Elasticsearch query •entity_search_query combines two (new, improved) SpanQueries •Scoring is additive – the more adverse terms found near the name, the higher the score Name matching in free text @FlaxSearch
  • 19. • entity_search query { "query": { "entity_search": { "name", "adverse_terms", "fields", "analyzers", "fuzziness", "slop", "boost", "synonym_boost", "order_bias" } } } Name matching in free text @FlaxSearch
  • 20. • entity_search query { "query": { "entity_search": { "name": "Mr Benn", "adverse_terms": [ "shopkeeper", "hat", "costume shop"], "fields", "analyzers", "fuzziness", "slop",<< slop distance "boost", << default is 1 "synonym_boost", << default is less than 1, to score synonyms lower "order_bias"<< if 0,only given order matches, or use to score out-of-order matches lower } } } Name matching in free text @FlaxSearch HW
  • 23. ● Why does it need to be explainable? ○ Finance organisations need to explain to regulators why they may or may not have considered a particular hit ○ Need to be able to limit the data volume across all searches in a way that is defensible ● Why doesn’t standard scoring cut it? ○ Dependent on field stats of entire dataset, not exactly transparent or obvious ○ Does not take into account “risk”, i.e. severity of particular term, only frequency ○ Does not allow setting a “global” threshold Name matching in free text @FlaxSearch CH
  • 25. ● Arachnys' client required a high volume monitoring solution ○ Based on a third party data set and name matching logic ○ Name matches scored from 0-100% ○ Supports: ■ Synonyms (William, Bill, Wilhelm etc.) ■ Typos ■ Initial matches ■ Missing spaces, amongst other edge cases Replicating Relevancy @FlaxSearch
  • 26. ● Arachnys' client required a high volume monitoring solution ○ Based on a third party data set and name matching logic ○ Name matches scored from 0-100% ○ Supports: ■ Synonyms (William, Bill, Wilhelm etc.) ■ Typos ■ Initial matches ■ Missing spaces, amongst other edge cases ● Arachnys wanted to implement the third parties relevancy calculations in Elasticsearch ○ People said it couldn't be done ○ People said it shouldn’t be done Replicating Relevancy @FlaxSearch
  • 27. • Custom indexing • Custom query parser • A proxy for Elasticsearch • Searchkit GUI www.searchkit.co Replicating Relevancy @FlaxSearch
  • 28. • Custom indexing – Source data is in Accuity format XML – Indexer written in Python • Normalises dates, addresses, IDs • Sorts out mappings • Pushes data to Elasticsearch – Extensive multi-field usage to index: • Initials • Stripped space • Word counts • Phonetics • Synonyms • ICU normalization Replicating Relevancy @FlaxSearch
  • 29. • Custom query parser written in Python: – Split terms using and search for them individually – Also do a phrase search – Features such as 'ignore initials' – Scoring: • If terms are on 'uninteresting' list lower score – e.g. “PLC”, “ltd” • Normalise to 0->1 using function query • Optional minimum score can be passed in Replicating Relevancy @FlaxSearch
  • 30. • A proxy for Elasticsearch – Uses Flask web service framework – Endpoints: • /healthcheck • /<indexname>/_debug • /<indexname>/_search • /<indexname>/_explain • /<indexname>/<doc_id> Replicating Relevancy @FlaxSearch
  • 31. • A proxy for Elasticsearch – Uses Flask web service framework – Endpoints: • /healthcheck << do some random query, if it return results then OK • /<indexname>/_debug << show what the parser returns • /<indexname>/_search << parse the query and do a search • /<indexname>/_explain << do a search with 'explain' on • /<indexname>/<doc_id> << return a document Replicating Relevancy @FlaxSearch
  • 32. • A wrinkle – Korean name searching – Korean names can be written as Hangul – Hangul doesn't use whitespace – This breaks the ICU tokeniser • Solution: – Check input string for Hangul – Use n-grams • Works better than the system we're replicating! Replicating Relevancy @FlaxSearch HW
  • 33. •Examples ○ STEVENS, DAVID ■ David Stevens should be 100% ■ David John Stevens should be in the 90s ■ Dave John Stevens should be less than that due to synonyms ■ David Johnston shouldn’t match Replicating Relevancy @FlaxSearch
  • 34. •Examples (2) ○ Arachnys Entertainment & Resorts Co ■ Arachnys Entertainment & Resorts Co Ltd should be 100% ■ Arachnys Entertainment & Resorts should be very close to 100% ■ Arachnys should be over 50% due to brand recognition ■ Arachnys Information Systems below 50% due to mismatching terms Replicating Relevancy @FlaxSearch
  • 35. • Tested using Quepid www.quepid.com Replicating Relevancy @FlaxSearch
  • 36. ● Name matching in free text: ○ Entity extraction (or basic capitalized phrase extraction) to better support more advanced name matching: it’s difficult to do anything nicer without incurring a lot of noise ● Replicating relevancy ○ Ensuring ElasticSearch handles analysis; client side analysis is a minefield ○ Generalizing to a self contained name analyzer The future... @FlaxSearch CH
  • 38. ● UX Designer ● Front-end Developer ● Senior Software Engineer Apply at https://arachnys.workable.com/ @arachnys AS Arachnys is hiring for:

Editor's Notes

  • #7: Does anyone have any idea what I&amp;apos;m on about? If you want to find out, look up &amp;apos;siteswap notation&amp;apos; later