Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch

Charlie Hull - Managing Director, Flax
Harry Waye – CTO
Adam Schakaki - Developer, Arachnys
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
@FlaxSearch
@arachnys
Finding the Bad Actor:
Custom scoring & forensic name matching
with Elasticsearch

• We build, tune and support fast, accurate and highly scalable
search, analytics and Big Data applications
• We use (and create) open source software
• We're independent, honest and have 16+ years experience
• We also:
– Run and attend many events & conferences
– Write extensively about search & related matters
– Train and mentor

Arachnys are an investigations technology company based in
London & NY. They employ robotic process automation to
enhance investigations and automate manual work.
Arachnys is a technology company that simplifies the process
of conducting enhanced due diligence. Our clients use us to
manage their regulatory and reputation based risk in-house,
going far above and beyond basic sanctions and PEP list
screening.
Arachnys use a 23,000+ source library of risk-relevant
information, searching in 97 languages and in 120+ countries. It
is the industry leader in emerging markets.
Who are ?
@FlaxSearch

• A two-part project:
1.Name matching in free text -an Elasticsearch plugin to
find names & adverse terms
2.Replicating Relevancy - replicate with Elasticsearch how
a third party service scores name matches
•Complicating factors:
1.Scores need to be “comparable” between searches
2.The client had to understand how this all worked
...so we're going to try and explain it to you!
Finding the Bad Actor
@FlaxSearch
HW

But first...a short demonstration....
@FlaxSearch
CH

• An example:
– Query for: “William Gates” near “money launder”
– Match: “Bill Henry Gates accused of money
laundering”
– Do not match: “Bills gates linked to money
launder accused”
Name matching in free text
@FlaxSearch

• An example (2)
– Query string approximation:
“William Gates”~2 AND “money launder”
– Problems:
■ We want proximity between the words in “William
Gates” and between the phrases “William
Gates” and “money launder”
■ We want fine grained control of scores
■ Lots of analyzer subtleties: plurals, stemming, middle
names, typos
@FlaxSearch

• The solution - DPeso algorithm
– SpanQuery to find a name
– SpanQuery to find one of a set of adverse terms
– Score documents by:
• Frequency & position of matches of both SpanQueries
• Minimum distance between a match for each
SpanQuery
• Name exactness: Synonyms, name permutations,
middle names
@FlaxSearch

Hang on, what's a SpanQuery?
SpanNearQuery spanNear =
new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "money")),
new SpanTermQuery(new Term(FIELD, "laundering"))},
4,
True);
@FlaxSearch

4, << slop distance
True); << true if order important
@FlaxSearch

4, << slop distance
True); << true if order important
matches “money laundering” but not “laundering business
makes money”
@FlaxSearch

In Elasticsearch:
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "field" : "money" } },
{ "span_term" : { "field" : "laundering" } },
],
"slop" : 4,
"in_order" : true
}
}
}
@FlaxSearch

In Elasticsearch:
{
"query": {
"span_near" : {
"clauses" : [
{ "span_term" : { "field" : "money" } },
{ "span_term" : { "field" : "laundering" } },
],
"slop" : 4,
"in_order" : true
}
}
}
– Note there are other types of SpanQueries, and you can
combine them in various ways as you might expect
@FlaxSearch

• But...we needed to extend SpanQueries
– Standard Lucene SpanQueries score using basic
frequency
– But we care where a match appears
– We also care whether it's the actual word or a synonym
that matches
@FlaxSearch

frequency
that matches
• So we extended SpanQuery scoring
– Earlier hits count for more
– Synonym matches count for less
@FlaxSearch

frequency
that matches
• So we extended SpanQuery scoring
– Earlier hits count for more
– Synonym matches count for less
•Implemented twice for Lucene 4 and Lucene 6 due to Span
API changes (Alan’s fault)
@FlaxSearch

•Created a new Elasticsearch query
•entity_search_query combines two (new, improved)
SpanQueries
•Scoring is additive – the more adverse terms found near the
name, the higher the score
@FlaxSearch

• entity_search query
{
"query": {
"entity_search": {
"name",
"adverse_terms",
"fields",
"analyzers",
"fuzziness",
"slop",
"boost",
"synonym_boost",
"order_bias"
}
}
}
@FlaxSearch

• entity_search query
{
"query": {
"entity_search": {
"name": "Mr Benn",
"adverse_terms": [ "shopkeeper", "hat", "costume shop"],
"fields",
"analyzers",
"fuzziness",
"slop",<< slop distance
"boost", << default is 1
"synonym_boost", << default is less than 1, to score synonyms lower
"order_bias"<< if 0,only given order matches, or use to score out-of-order matches
lower
}
}
}
@FlaxSearch
HW

● Why does it need to be explainable?
○ Finance organisations need to explain to regulators why
they may or may not have considered a particular hit
○ Need to be able to limit the data volume across all
searches in a way that is defensible
● Why doesn’t standard scoring cut it?
○ Dependent on field stats of entire dataset, not exactly
transparent or obvious
○ Does not take into account “risk”, i.e. severity of
particular term, only frequency
○ Does not allow setting a “global” threshold
@FlaxSearch
CH

Replicating Relevancy
@FlaxSearch
HW

● Arachnys' client required a high volume monitoring solution
○ Based on a third party data set and name matching logic
○ Name matches scored from 0-100%
○ Supports:
■ Synonyms (William, Bill, Wilhelm etc.)
■ Typos
■ Initial matches
■ Missing spaces, amongst other edge cases
@FlaxSearch

● Arachnys' client required a high volume monitoring solution
○ Based on a third party data set and name matching logic
○ Name matches scored from 0-100%
○ Supports:
■ Synonyms (William, Bill, Wilhelm etc.)
■ Typos
■ Initial matches
■ Missing spaces, amongst other edge cases
● Arachnys wanted to implement the third parties relevancy
calculations in Elasticsearch
○ People said it couldn't be done
○ People said it shouldn’t be done
@FlaxSearch

• Custom indexing
• Custom query parser
• A proxy for Elasticsearch
• Searchkit GUI www.searchkit.co
@FlaxSearch

• Custom indexing
– Source data is in Accuity format XML
– Indexer written in Python
• Normalises dates, addresses, IDs
• Sorts out mappings
• Pushes data to Elasticsearch
– Extensive multi-field usage to index:
• Initials
• Stripped space
• Word counts
• Phonetics
• Synonyms
• ICU normalization
@FlaxSearch

• Custom query parser written in Python:
– Split terms using and search for them individually
– Also do a phrase search
– Features such as 'ignore initials'
– Scoring:
• If terms are on 'uninteresting' list lower score
– e.g. “PLC”, “ltd”
• Normalise to 0->1 using function query
• Optional minimum score can be passed in
@FlaxSearch

– Uses Flask web service framework
– Endpoints:
• /healthcheck
• /<indexname>/_debug
• /<indexname>/_search
• /<indexname>/_explain
• /<indexname>/<doc_id>
@FlaxSearch

– Uses Flask web service framework
– Endpoints:
• /healthcheck << do some random query, if it return results then OK
• /<indexname>/_debug << show what the parser returns
• /<indexname>/_search << parse the query and do a search
• /<indexname>/_explain << do a search with 'explain' on
• /<indexname>/<doc_id> << return a document
@FlaxSearch

• A wrinkle – Korean name searching
– Korean names can be written as Hangul
– Hangul doesn't use whitespace
– This breaks the ICU tokeniser
• Solution:
– Check input string for Hangul
– Use n-grams
• Works better than the system we're replicating!
@FlaxSearch
HW

•Examples
○ STEVENS, DAVID
■ David Stevens should be 100%
■ David John Stevens should be in the 90s
■ Dave John Stevens should be less than that due
to synonyms
■ David Johnston shouldn’t match
@FlaxSearch

•Examples (2)
○ Arachnys Entertainment & Resorts Co
■ Arachnys Entertainment & Resorts Co Ltd
should be 100%
■ Arachnys Entertainment & Resorts should be
very close to 100%
■ Arachnys should be over 50% due to brand
recognition
■ Arachnys Information Systems below 50%
due to mismatching terms
@FlaxSearch

• Tested using Quepid www.quepid.com
@FlaxSearch

● Name matching in free text:
○ Entity extraction (or basic capitalized phrase extraction)
to better support more advanced name matching: it’s
difficult to do anything nicer without incurring a lot of
noise
● Replicating relevancy
○ Ensuring ElasticSearch handles analysis; client side
analysis is a minefield
○ Generalizing to a self contained name analyzer
The future...
@FlaxSearch
CH

Thankyou!
Any questions?
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch

● UX Designer
● Front-end Developer
● Senior Software Engineer
Apply at https://arachnys.workable.com/
@arachnys
AS
Arachnys is hiring for:

Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch

More Related Content

What's hot (20)

Similar to Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch (20)

More from Charlie Hull (10)

Recently uploaded (20)

Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch

Editor's Notes