SlideShare a Scribd company logo
Beyond TF-IDF
Stephen Murtagh
etsy.com
Beyond tf idf why, what & how
20,000,000 items
1,000,000 sellers
Beyond tf idf why, what & how
15,000,000 daily
searches
80,000,000 daily calls to Solr
Etsy Engineering
• Code as Craft - our engineering blog
• http://codeascraft.etsy.com/
• Continuous Deployment
• https://github.com/etsy/deployinator
• Experiment-driven culture
• Hybrid engineering roles
• Dev-Ops
• Data-Driven Products
Etsy Search
• 2 search clusters: Flip and Flop
• Master -> 20 slaves
• Only one cluster takes traffic
• Thrift (no HTTP endpoint)
• BitTorrent for index replication
• Solr 4.1
• Incremental index every 12 minutes
Beyond TF-IDF
•Why?
•What?
•How?
Beyond tf idf why, what & how
Luggage tags
“unique bag”
q = unique+bag
q = unique+bag
>
Scoring in Lucene
Scoring in Lucene
Fixed for any given query
constant
Scoring in Lucene
f(term, document)
f(term)
Scoring in Lucene
User content
Only measure rarity
IDF(“unique”)
4.429547
IDF(“bag”)
4.32836>
q = unique+bag
“unique unique bag” “unique bag bag”
>
“unique” tells us
nothing...
Stop words
• Add “unique” to stop word list?
• What about “handmade” or “blue”?
• Low-information words can still be useful
for matching
• ... but harmful for ranking
Why not replace IDF?
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
•How?
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
•How?
What do we replace it
with?
Benefits of IDF
I1 =





doc1 doc2 doc3 . . . docn
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





Benefits of IDF
I1 =





doc1 doc2 doc3 . . . docn
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





IDF(jewelry) = 1 + log(
n

d id,jewelry
)
Sharding
I1 =





doc1 doc2 doc3 . . . dock
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





I2 =





dock+1 dock+2 dock+3 . . . docn
art 6 1 0 . . . 1
jewelry 0 1 3 . . . 0
...
...
termm 0 1 1 . . . 0





Sharding
I1 =





doc1 doc2 doc3 . . . dock
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





I2 =





dock+1 dock+2 dock+3 . . . docn
art 6 1 0 . . . 1
jewelry 0 1 3 . . . 0
...
...
termm 0 1 1 . . . 0





IDF(jewelry) = 1 + log(
n

d id,jewelry
)
Sharding
I1 =





doc1 doc2 doc3 . . . dock
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





I2 =





dock+1 dock+2 dock+3 . . . docn
art 6 1 0 . . . 1
jewelry 0 1 3 . . . 0
...
...
termm 0 1 1 . . . 0





IDF1(jewelry) = IDF2(jewelry) = IDF(jewelry)
Sharded IDF options
• Ignore it - Shards score differently
• Shards exchange stats - Messy
• Central source distributes IDF to shards
Information Gain
• P(x) - Probability of x appearing in a listing
• P(x|y) - Probability of x appearing given y appears
info(y) = D(P(X|y)||P(X))
info(y) = Σx∈X log(
P(x|y)
P(x)
) ∗ P(x|y)
Term Info(x) IDF
unique 0.26 4.43
bag 1.24 4.33
pattern 1.20 4.38
original 0.85 4.38
dress 1.31 4.42
man 0.64 4.41
photo 0.74 4.37
stone 0.92 4.35
Similar IDF
Term Info(x) IDF
unique 0.26 4.39
black 0.22 3.32
red 0.22 3.52
handmade 0.20 3.26
two 0.32 5.64
white 0.19 3.32
three 0.37 6.19
for 0.21 3.59
Similar Info Gain
q = unique+bag
Using IDF
score(“unique unique bag”)

score(“unique bag bag”)
Using information gain
score(“unique unique bag”)

score(“unique bag bag”)
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
•How?
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
• Information gain accounts for term quality
•How?
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
• Information gain accounts for term quality
•How?
Listing Quality
• Performance relative
to rank
• Hadoop: logs - hdfs
• cron: hdfs - master
• bash: master - slave
• Loaded as external file
field
Computing info gain
I1 =





doc1 doc2 doc3 . . . docn
art 2 0 1 . . . 1
jewelry 1 3 0 . . . 0
...
...
termm 1 0 1 . . . 0





info(y) = D(P(X|y)||P(X))
info(y) = Σx∈X log(
P(x|y)
P(x)
) ∗ P(x|y)
Hadoop
• Brute-force
• Count all terms
• Count all co-occuring terms
• Construct distributions
• Compute info gain for all terms
File Distribution
• cron copies score file to master
• master replicates file to slaves
infogain=`find /search/data/ -maxdepth 1 -type f -
name info_gain.* -print | sort | tail -n 1`
scp $infogain user@$slave:$infogain
File Distribution
schema.xml
Beyond TF-IDF
•Why?
• IDF ignores term “usefulness”
•What?
• Information gain accounts for term quality
•How?
• Hadoop + similarity factory = win
Fast Deploys,
Careful Testing
• Idea
• Proof of Concept
• Side-By-Side
• A/B test
• 100% Live
Side-by-Side
Beyond tf idf why, what & how
Beyond tf idf why, what & how
Beyond tf idf why, what & how
Relevant != High quality
A/B Test
• Users are randomly assigned to A or B
• A sees IDF-based results
• B sees info gain-based results
A/B Test
• Users are randomly assigned to A or B
• A sees IDF-based results
• B sees info gain-based results
• Small but significant decrease in clicks,
page views, etc.
More homogeneous results
Lower average quality score
Next Steps
Parameter Tweaking...
Rebalance relevancy and quality signals in score
The Future
Latent Semantic
Indexing in Solr/Lucene
Latent Semantic
Indexing
• In TF-IDF, documents are sparse vectors in
term space
• LSI re-maps these to dense vectors in
“concept” space
• Construct transformation matrix:
• Load file at index and query time
• Re-map query and documents
Rm
+
Rr
Tr×m
CONTACT
Stephen Murtagh
smurtagh@etsy.com

More Related Content

PPT
Profiling and optimization
PDF
Python Performance 101
PPT
Euro python2011 High Performance Python
PDF
Odessapy2013 - Graph databases and Python
PDF
Java8 stream
PDF
Scientific Computing with Python - NumPy | WeiYuan
KEY
Numpy Talk at SIAM
PPTX
EuroPython 2016 - Do I Need To Switch To Golang
Profiling and optimization
Python Performance 101
Euro python2011 High Performance Python
Odessapy2013 - Graph databases and Python
Java8 stream
Scientific Computing with Python - NumPy | WeiYuan
Numpy Talk at SIAM
EuroPython 2016 - Do I Need To Switch To Golang

What's hot (20)

PDF
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
PDF
Python于Web 2.0网站的应用 - QCon Beijing 2010
PPTX
Introduction to PyTorch
PDF
Dive Into PyTorch
PDF
MTL Versus Free
PPSX
Scala @ TomTom
PDF
Hammurabi
PDF
Rainer Grimm, “Functional Programming in C++11”
PDF
Kotlin Slides from Devoxx 2011
PDF
Introduction to functional programming using Ocaml
PDF
The Next Great Functional Programming Language
PDF
Introduction to NumPy for Machine Learning Programmers
PDF
Swift for tensorflow
PDF
Machine learning with py torch
PDF
The best language in the world
PDF
Python tour
PDF
Humble introduction to category theory in haskell
PDF
Python fundamentals - basic | WeiYuan
PDF
Java Class Design
PPTX
Scala - where objects and functions meet
"PyTorch Deep Learning Framework: Status and Directions," a Presentation from...
Python于Web 2.0网站的应用 - QCon Beijing 2010
Introduction to PyTorch
Dive Into PyTorch
MTL Versus Free
Scala @ TomTom
Hammurabi
Rainer Grimm, “Functional Programming in C++11”
Kotlin Slides from Devoxx 2011
Introduction to functional programming using Ocaml
The Next Great Functional Programming Language
Introduction to NumPy for Machine Learning Programmers
Swift for tensorflow
Machine learning with py torch
The best language in the world
Python tour
Humble introduction to category theory in haskell
Python fundamentals - basic | WeiYuan
Java Class Design
Scala - where objects and functions meet
Ad

Similar to Beyond tf idf why, what & how (20)

PPTX
How to Build a Semantic Search System
PDF
Webinar: Modern Techniques for Better Search Relevance with Fusion
PPTX
Reflected Intelligence: Lucene/Solr as a self-learning data system
PPTX
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
PDF
Modern Association Rule Mining Methods
PPTX
unit -4MODELING AND RETRIEVAL EVALUATION
PDF
Modern association rule mining methods
PDF
Comparisons of ranking algorithms
PPT
Vsm 벡터공간모델
PPT
Vsm 벡터공간모델
PDF
IRJET - Document Comparison based on TF-IDF Metric
PPT
Data Mining
PPT
Lecture1
PPTX
Xomia_20220602.pptx
PPT
Hands on Mahout!
PDF
IRJET- Customer Online Buying Prediction using Frequent Item Set Mining
PDF
IRJET- Semantics based Document Clustering
PPT
score based ranking of documents
PPTX
Aggregating Multiple Dimensions for Computing Document Relevance
DOCX
UNIT 3 IRT.docx
How to Build a Semantic Search System
Webinar: Modern Techniques for Better Search Relevance with Fusion
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Modern Association Rule Mining Methods
unit -4MODELING AND RETRIEVAL EVALUATION
Modern association rule mining methods
Comparisons of ranking algorithms
Vsm 벡터공간모델
Vsm 벡터공간모델
IRJET - Document Comparison based on TF-IDF Metric
Data Mining
Lecture1
Xomia_20220602.pptx
Hands on Mahout!
IRJET- Customer Online Buying Prediction using Frequent Item Set Mining
IRJET- Semantics based Document Clustering
score based ranking of documents
Aggregating Multiple Dimensions for Computing Document Relevance
UNIT 3 IRT.docx
Ad

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
State of the Art Logging. Kibana4Solr is Here!
PDF
Search at Twitter
PDF
Building Client-side Search Applications with Solr
PDF
Integrate Solr with real-time stream processing applications
PDF
Scaling Solr with SolrCloud
PDF
Administering and Monitoring SolrCloud Clusters
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
PDF
Using Solr to Search and Analyze Logs
PDF
Enhancing relevancy through personalization & semantic search
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
Solr's Admin UI - Where does the data come from?
PDF
Schemaless Solr and the Solr Schema REST API
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PDF
Faceted Search with Lucene
PDF
Recent Additions to Lucene Arsenal
PDF
Turning search upside down
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PDF
Shrinking the haystack wes caldwell - final
Text Classification Powered by Apache Mahout and Lucene
State of the Art Logging. Kibana4Solr is Here!
Search at Twitter
Building Client-side Search Applications with Solr
Integrate Solr with real-time stream processing applications
Scaling Solr with SolrCloud
Administering and Monitoring SolrCloud Clusters
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Using Solr to Search and Analyze Logs
Enhancing relevancy through personalization & semantic search
Real-time Inverted Search in the Cloud Using Lucene and Storm
Solr's Admin UI - Where does the data come from?
Schemaless Solr and the Solr Schema REST API
High Performance JSON Search and Relational Faceted Browsing with Lucene
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Faceted Search with Lucene
Recent Additions to Lucene Arsenal
Turning search upside down
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Shrinking the haystack wes caldwell - final

Recently uploaded (20)

PDF
01-Introduction-to-Information-Management.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Cell Types and Its function , kingdom of life
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Pharma ospi slides which help in ospi learning
PDF
Classroom Observation Tools for Teachers
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
01-Introduction-to-Information-Management.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Week 4 Term 3 Study Techniques revisited.pptx
Final Presentation General Medicine 03-08-2024.pptx
Cell Types and Its function , kingdom of life
Microbial diseases, their pathogenesis and prophylaxis
Anesthesia in Laparoscopic Surgery in India
PPH.pptx obstetrics and gynecology in nursing
Pharma ospi slides which help in ospi learning
Classroom Observation Tools for Teachers
102 student loan defaulters named and shamed – Is someone you know on the list?
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
STATICS OF THE RIGID BODIES Hibbelers.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
FourierSeries-QuestionsWithAnswers(Part-A).pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Supply Chain Operations Speaking Notes -ICLT Program
O7-L3 Supply Chain Operations - ICLT Program
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table

Beyond tf idf why, what & how