Semantic & Multilingual Strategies in Lucene/Solr

Semantic & Multilingual Strategies in Lucene/Solr
Trey Grainger
Director of Engineering, Search & Analytics@CareerBuilder

Outline
•Introduction
•Text Analysis Refresher
•Language-specific text Analysis
•Multilingual Search Strategies
•Automatic Language Identification
•Semantic Search Strategies (understanding “meaning”)
•Conclusion

About Me
Trey Grainger
Director of Engineering, Search & Analytics
Joined CareerBuilderin 2007 as Software Engineer
MBA, Management of Technology –GA Tech
BA, Computer Science, Business, & Philosophy –Furman University
Mining Massive Datasets (in progress) -Stanford University
Fun outside of CB:
•Co-author of Solr in Action, plus several research papers
•Frequent conference speaker
•Founder of Celiaccess.com, the gluten-free search engine
•Lucene/Solrcontributor

At CareerBuilder, SolrPowers...

Text Analysis Refresher
A text field in Lucene/Solrhas an Analyzer containing:
①Zero or more CharFilters
Takes incoming text and “cleans it up” before it is tokenized
②One Tokenizer
Splits incoming text into a Token Stream containing Zero or more Tokens
③Zero or more TokenFilters
Examines and optionally modifies each Token in the Token Stream
*From Solrin Action, Chapter 6

Text Analysis Refresher
A text field in Lucene/Solrhas an Analyzer containing:
①Zero or more CharFilters
Takes incoming text and “cleans it up” before it is tokenized
②OneTokenizer
Splits incoming text into a Token Stream containing Zero or more Tokens
③Zero or more TokenFilters
Examines and optionally modifies each Token in the Token Stream

Language-specific Text Analysis

Example English Analysis Chains
<fieldTypename="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizerclass="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
words="lang/stopwords_en.txt” ignoreCase="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="lang/en_protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldTypename="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilterclass="solr.HTMLStripCharFilterFactory"/>
<tokenizerclass="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" IignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

Per-language Analysis Chains
*Some of the 32 different languages configurations in Appendix B of Solrin Action

Which Stemmer do I choose?

Common English Stemmers

When Stemming goes awry
Fixing Stemming Mistakes:
•Unfortunately, every stemmer will have problem-cases that aren’t handled as you would expect
•Thankfully, Stemmers can be overriden
•KeywordMarkerFilter: protects a list of terms you specify from being stemmed
•StemmerOverrideFilter: applies a list of custom term mappings you specify
Alternate strategy:
•Use Lemmatization(root-form analysis) instead of Stemming
•Commercial vendorshelp tremendously in this space(see http://www.basistech.com/case-study-career-builder/)
•The Hunspellstemmer enables dictionary-based support of varying quality in over 100 languages

Stemming vs. Lemmatization
•Stemming: algorithmic manipulation of text, based upon common per-language rules
•Lemmatization: finds the dictionary form of a term (lemma means “root”)
-dramatically improves precision(only matching terms that “should” match), while not significantly impacting recall(all terms that should match do match).

Multilingual Search Strategies

Multilingual Search Strategies
How do you handle:
…a different language per document?
…multiple languages in the same document? …multiple languages in the same field?
Strategies:
1)Separate field per language
2)Separate collection/core per language
3)All languages in one field

Strategy 1: Separate field per language

Separate field per language
<field name="id" type="string" indexed="true" stored="true" /> <field name="title" type="string" indexed="true" stored="true" /> <field name="content_english" type="text_english" indexed="true” stored="true" /> <field name="content_french" type="text_french" indexed="true” stored="true" /> <field name="content_spanish" type="text_spanish" indexed="true” stored="true" />
<fieldTypename="text_english" class="solr.TextField"
<analyzer>
<filter class="solr.StopFilterFactory” ignoreCase="true"
words="lang/stopwords_en.txt"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.KStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldTypename="text_spanish" class="solr.TextField"
<analyzer>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/
stopwords_es.txt" format="snowball"/>
<filter class="solr.SpanishLightStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldTypename="text_french" class="solr.TextField"
<analyzer>
<filter class="solr.ElisionFilterFactory” ignoreCase="true"
articles="lang/contractions_fr.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt” format="snowball"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
</analyzer>
</fieldType>
schema.xml

Separate field per language: one language per document
<doc>
<field name="id">1</field>
<fieldname="title">The Adventures of Huckleberry Finn</field>
<field name="content_english">YOU don't know about me without you have read
a book by the name of The Adventures of Tom Sawyer; but that ain'tno
matter. That book was made by Mr. Mark Twain, and he told the truth,
mainly. There was things which he stretched, but mainly he told the truth.
<field>
</doc>
<doc>
<field name="id ">2</field>
<field name="title">Les Misérables</field>
<field name="content_french">Nuln'auraitpule dire; tout cequ'onsavait,
c'estque, lorsqu'ilrevintd'Italie, ilétaitprêtre.
</field>
</doc>
<doc>
<field name="title">Don Quixote</field>
<field name="content_spanish">Demasiadacordurapuedeserla peorde las
locuras, verla vidacomoesy no comodeberíade ser.
</field>
</doc>
Query:
http://localhost:8983/solr/field-per-language/select?
fl=title&
defType=edismax&
qf=content_englishcontent_frenchcontent_spanish&
q="he told the truth" OR"ilétaitprêtre" OR"verla vidacomoes"
Response:
{
"response":{"numFound":3,"start":0,"docs":[
{
"title":["The Adventures of Huckleberry Finn"]},
{
"title":["Don Quixote"]},
{
"title":["Les Misérables"]}]
}

Separate field per language: multiple languages per document
Query 1:
http://localhost:8983/solr/field-per-language/select?
fl=title&
defType=edismax&
qf=content_englishcontent_frenchcontent_spanish&
q="wisdom”
Query 2:
http://localhost:8983/solr/field-per-language/select?...
q="sabiduría”
Query 3:
http://localhost:8983/solr/field-per-language/select?...
q="sagesse”
Response: (same for queries 1–3)
{
"response":{"numFound":1,"start":0,"docs":[
{
"title":["Proverbs"]}]
}
Documents:
<doc>
<field name="title">Proverbs</field>
<field name="content_spanish"> No la abandonesy ellavelarásobre
ti, ámalay ellateprotegerá. Lo principal esla sabiduría; adquiere
sabiduría, y con todolo queobtengasadquiereinteligencia.
</field>
<field name="content_english">Do not forsake wisdom, and she will protect you; love her, and she will watch over you. Wisdom is supreme;
therefore get wisdom. Though it cost all you have, get understanding.
</field>
<field name="content_french">N'abandonnepas la sagesse, et ellete
gardera, aime-la, et elleteprotégera. Voicile début de la sagesse:
acquierslasagesse, procure-toile discernementau prix de tout cequetupossèdes.
<field>
</doc>

Summary: Separate field per language

Strategy 2: Separate collection per language

Separate collection per language: schema.xml

Separate collection per language: Indexing & Querying
Indexing:
cd $SOLR_IN_ACTION/example-docs/
java -jar -Durl=http://localhost:8983/solr/english/update post.jar
➥ch14/documents/english.xml
java -jar -Durl=http://localhost:8983/solr/spanish/update post.jar
➥ch14/documents/spanish.xml
java -jar -Durl=http://localhost:8983/solr/french/update post.jar
➥ch14/documents/french.xml
Query (collections in SolrCloud):
http://localhost:8983/solr/aggregator/select?
shards=english,spanish,french
df=content&
q=query in any language here
Query (specific cores):
http://localhost:8983/solr/aggregator/select?
shards=localhost:8983/solr/english,
localhost:8983/solr/spanish,
localhost:8983/solr/french&
df=content&
q=query in any language here
Documents:
All documents just have a single “content” field. The documents get routedto a different language-specific Solrcollection based upon the language of the content field.

Summary: Separate index per language

Strategy 3: One Field for all languages

One Field for all languages: Feature Status
•Note: This feature is not yet committed to Solr
•I’m working on it in my free time. Currently it supports:
•Update Request Processorwhich canautomatically detect the languages of documentsand choose the correct analyzers
•Field Type which allows dynamically choosing one or more analyzers on a per-field (indexing) and per term (querying) basis.
•Current Code from Solr in Actionis available and is freely available on github.
•There is a JIRA ticket open to ultimately contribute this back to Solr: Solr-6492
•Some work is still necessary to make querying more user friendly.

One Field for all languages
Step 1: Define Multilingual Field
schema.xml:
<fieldTypename="multilingual_text" class="sia.ch14.MultiTextField"
sortMissingLast="true" defaultFieldType="text_general"
fieldMappings="en:text_english,
es:text_spanish,
fr:text_french,
de:text_german"/>[1]
<field name="text" type="multilingual_text" indexed="true" multiValued="true" />
[1]Note that "text_english", "text_spanish", "text_french", and "text_german" refer to field types defined elsewhere in the schema.xml
[2]Uses the "defaultFieldType", in this case "text_general", defined elsewhere in schema.xml
<add><doc>…
<field name="text">general keywords</field> [2] <field name="text”>en,es|theschool, lasescuelas</field>… </doc></add> <add><doc>…
<field name="text">en|theschool</field>
<field name="text">es|lasescuelas</field>…
</doc></add>
Step 2: Index documents
http://localhost:8983/solr/collection1/select? q=es|escuelaOR en,es,de|schoolOR school [2]
Step 3: Search

One Field For All Languages: Stacked Token Streams
1) English Field
2) Spanish Field
3) English + Spanish combined in Multilingual Text Field
multilingual_text
①For each language requested, the appropriate field type is chosen
②The input text is passed separately to the Analyzer chain for each field type
③The resulting Token Streams from each Analyzer chain arestacked into a unified Token Stream based upon their position increments
*Screenshot from Solrin Action, Chapter 14

Strategy 3: All languages in one field
*
*See Solrin Action, Chapter 14

Automatic Language Identification

Identifying languages in documents
solrconfig.xml
...
<updateRequestProcessorChainname="langid">
<processorclass="org.apache.solr.update.processor.
LangDetectLanguageIdentifierUpdateProcessorFactory">
<lstname="invariants">
<strname="langid.fl">content, content_lang1,content_lang2,content_lang3</str>
<strname="langid.langField">language</str>
<strname="langid.langsField">languages</str>
...
</lst>
</processor>
..
</updateRequestProcessorChain>
…
<requestHandlername="/update" class="solr.UpdateRequestHandler">
<strname="update.chain">langid</str>
</lst>
</requestHandler>
...
schema.xml
...
<field name="language" type="string" indexed="true" stored="true" />
<field name="languages" type="string" indexed="true" stored="true" multiValued="true"/>
...

Identifying languages in documents
Sending documents:
cd $SOLR_IN_ACTION/example-docs/
java -Durl=http://localhost:8983/solr/langid/update
➥-jar post.jarch14/documents/langid.xml
Query
http://localhost:8983/solr/langid/select?
q=*:*&
fl=title,language,languages
Results
[{ "title":"TheAdventures of HuckelberryFinn",
"language":"en",
"languages":["en"]},
{
"title":"LesMisérables",
"language":"fr",
"languages":["fr"]},
{
"title":"DonQuoxite",
"language":"es",
"languages":["es"]},
{
"title":"Proverbs",
"language":"fr",
"languages":["fr”, "en”,"es"]}]

Mapping data to language-specific fields
solrconfig.xml
...
<updateRequestProcessorChainname="langid">
<processorclass="org.apache.solr.update.processor.
LangDetectLanguageIdentifierUpdateProcessorFactory">
<strname="langid.fl">content</str>
<strname="langid.langField">language</str>
<strname="langid.map">true</str>
<strname="langid.map.fl">content</str>
<strname="langid.whitelist">en,es,fr</str>
<strname="langid.map.lcmap"> en:englishes:spanishfr:french</str>
<strname="langid.fallback">en</str>
</lst>
</processor>
...
</updateRequestProcessorChain>
...
Indexed Documents:
[{
"title":"TheAdventures of Huckleberry Finn",
"language":"en",
"content_english":[ "YOU don't know about me without..."]},
{
"title":"LesMisérables",
"language":"fr",
"content_french":[ "Nuln'auraitpule dire; tout ce..."]},
{
"title":"DonQuixote",
"language":"es",
"content_spanish":[ "Demasiadacordurapuedeserla peor..."]}]
}]

The need for Semantic Search
User’s Query: machine learning research and development Portland, OR software engineer AND hadoopjava
Traditional Query Parsing: (machine ANDlearningANDresearch ANDdevelopmentANDportland) OR(software ANDengineer ANDhadoopANDjava)
Semantic Query Parsing: "machine learning" AND"research and development" AND"Portland, OR” AND"software engineer" ANDhadoopANDjava
Semantically Expanded Query: ("machine learning"^10OR"data scientist" OR"data mining" OR"computer vision") AND("research and development"^10OR"r&d") ANDAND("Portland, OR"^10OR"Portland, Oregon" OR{!geofiltpt=45.512,-122.676 d=50sfield=geo}) AND("software engineer"^10OR"software developer") AND(hadoop^10OR"big data" ORhbaseORhive) AND(java^10 ORj2ee)

Semantic Search Architecture –Query Parsing
1)Generate Model of Domain-specific phrases
•Can mine query logs or actual text of documents for significant phrases within your domain [1]
2)Feed known phrases to SolrTextTagger(uses LuceneFST for high-throughput term lookups)
3)Use SolrTextTaggerto perform entity extraction on incoming queries(tagging documents is also optional)
4)Shown on next slide: Pass extracted entities to a Query Augmentation phase to rewrite query with enhanced semantic understanding(synonyms, related keywords, related categories, etc.)
[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "CrowdsourcedQuery Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
[2]https://github.com/OpenSextant/SolrTextTagger

machine learning
Keywords:
Search Behavior,
Application Behavior, etc.
Job Title Classifier, Skills Extractor, Job Level Classifier, etc.
Clustering relationships
Semantic Query Augmentation
keywords:((machine learning)^10OR { AT_LEAST_2: ("data mining"^0.9,matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55))}
{ BOOST_TO_TOP:(job_title:( "software engineer" OR "data manager" OR "data scientist" OR "hadoopengineer"))}
Modified Query:
Related Occupations
machine learning: {15-1031.00 .58Computer Software Engineers, Applications
15-1011.00 .55
Computer and Information Scientists, Research
15-1032.00 .52 Computer Software Engineers, Systems Software }
machine learning:
{ software engineer .65, data manager .3, data scientist .25, hadoopengineer .2, }
Common Job Titles
Semantic Search Architecture –Query Augmentation
Related Phrases
machine learning:
{ data mining .9, matlab.8, data scientist .75, artificial intelligence .7, neural networks .55 }
Known keyword phrases
java developer
machine learningregistered nurse

Differentiating related terms
Synonyms: cpa=> certified public accountant
rn=> registered nurser.n. => registered nurseAmbiguous Terms*: driver=> driver (trucking)~80% driver => driver (software)~20%
Related Terms: r.n. => nursing, bsnhadoop=> mapreduce, hive, pig
*differentiated based upon user and query context

Semantic Search “under the hood”

2014 Publications & Presentations
Books:
Solrin Action-A comprehensive guide to implementing scalable search using Apache Solr
Research papers:
●Towards a Job title Classification System
●Augmenting Recommendation Systems Using a Model of Semantically-related Terms Extracted from User Behavior
●sCooL: A system for academic institution name normalization
●CrowdsourcedQuery Augmentation through Semantic Discovery of Domain-specific jargon
●PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems
●SKILL: A System for Skill Identification and Normalization
Speaking Engagements:
●WSDM 2014 Workshop: “Web-Scale Classification: Classifying Big Data from the Web”
●Atlanta SolrMeetup
●Atlanta Big Data Meetup
●The Second International Symposium on Big Data and Data Analytics
●Lucene/SolrRevolution 2014
●RecSys2014
●IEEE Big Data Conference 2014

Conclusion
•Language analysis options for each language are very configurable
•There are multiple strategies for handling multilingual content based upon your use case
•When in doubt, automatic language detection can be easily leveraged in your indexing pipeline
•The next generation of query/relevancy improvements will be able to understand the intent of the user.

Contact Info
Yes, WE ARE HIRING@CareerBuilder. Come talk with me if you are interested…
Trey Grainger
trey.grainger@careerbuilder.com@treygrainger
http://solrinaction.com
Conference discount (43% off):lusorevcftw
Other presentations: http://www.treygrainger.com

Semantic & Multilingual Strategies in Lucene/Solr

More Related Content

What's hot (20)

Similar to Semantic & Multilingual Strategies in Lucene/Solr (20)

More from Trey Grainger (20)

Recently uploaded (20)

Semantic & Multilingual Strategies in Lucene/Solr