SlideShare a Scribd company logo
Search Logs
+ Machine Learning
= Automatic Tagging
John Berryman
Hi! I'm John Berryman
@JnBrymn
This is my name
with all the
unnecessary
letters removed.
● Degree in Aerospace Engineering
● Moved into Search Technology
● Wrote a book (... well 40% of one)
I got a haircut. Life
has been different
ever since.
That's me on
the cover.
● Discovery Engineer @ Eventbrite
(Search/Recommendations)
What is "tagging"?
… and why would you want it?
First, let's talk about e-commerce search
● Search is ubiquitous.
○ Search makes the internet accessible
○ Search is the backbone of many products
○ Search is embedded in most products
● E-commerce is powered by search
● Browse is an important aspect of
the experience. You filter inventory
based upon tags.
● Mobile users prefer browse over
text search.
● Everyone is moving to mobile.
These are tags!
These are
also tags!
What is "tagging"?
… and why would you want it?
It's the ability to CATEGORIZE and
UNDERSTAND your inventory.
Because it powers the emerging
dominant e-commerce interaction.
How can you tag your inventory?
● Use curators to tag content:
○ Benefits: control over tagging, uniform tagging approach
○ Drawbacks: curation approach must be define, curators must be trained, curators are
expensive
● Require tagging from content creators:
○ Benefits: content creators know their content the best, scales well
○ Drawback: content creators may not cooperate if they see no advantage for themselves
● Encourage customers to tag content:
○ Benefits: customers are the ones buying content and their idea of tags matters most
○ Drawbacks: there's even less likelihood for customers to cooperate
… but what of nobody wants to tag your content?
An Interesting Observation
● Every day millions of people search for events on Eventbrite.
● They issue approximately > 500K distinct queries in a month.
● But the most common 1,000 queries accounts for 41% of all search traffic.
● The common queries look like tags!
Can we use logged searches as a
training set to built a tagging model?
○ 5k run
○ back to school
○ job fair
○ 4th of july party
○ baby
○ real estate
○ car show
○ pool party
○ golf
○ gospel
○ speed dating
○ boat party
○ photography
○ dog
○ data science
○ business
○ kids
○ networking
○ christian
○ free
Search Logs
+ Machine Learning
= Automatic Tagging
John Berryman
Initial Approach
● Given – we have 3 tables:
○ search log
○ click log
○ event table
● Step 1: Find the most common 500 queries
● Step 2: Find all the events clicked after a user search using a common query
● Step 3: Collect the name and description of those events
● Create a training set:
○ X = input = title and body text of events
○ y = output = query string used to find them a.k.a. tags
● Train a model to predict y based on new X
tagging_with_searches_1.ipynb
emergency backup plan
Problems with this Approach
● Near synonym tags:
○ memorial day
○ memorial day weekend events
○ memorial day weekend
● Small tag vocabulary
● Each event only gets 1 or 2 tags. Sometimes 0.
Improved Approach
● A session may contain several queries. These queries are often related:
○ Spelling corrections
○ Word synonyms
○ Query Refinements or generalizations
● Idea:
○ Let's group statistically significant query strings together.
○ Then we can train the neural network based on the query string groups
query_string_clusters
and
tagging_with_searches_2
emergency backup plan
emergency backup plan
Things to Notice
Benefits
● Much fewer near-synonyms (bitcoin, block chain, blockchai → blockchain)
● More sample data
○ v1 model - 500 most popular queries - 33% of query traffic
○ v2 model - 2649 most popular queries collapsed down to 681 - 52% of query traffic
● Broader tags
Drawback
● Some of the clusters pull in very loosely related words
○ ai → blockchain
○ ozio → rosebar
Tagging-Related Applications
● Power Faceted Search
● Infer relationship between tags
● Provide organizers tag
recommendations
● Better understand supply and
demand
● Apply tags to users for better
recommendation
● Search Synonyms (e.g.
misspellings)
Future Work
● Better coverage
○ Currently reach 50% of our traffic with 2,500 queries.
Long tail is long! > 500K distinct queries in a month
○ Model biases towards short tail labels - everything's a "day party"
○ Can't cover searches for an event that isn't in our inventory.
● Create real pipeline
● Build out all the cool ideas on the last slide
Questions?
… better yet, Ideas?
Final notes:
● My jupyter notebooks are here:
○ First implementation
○ Query collapsing
○ Second implementation
○ Third implementation
Data Nerds
● Want to learn data science with
others? You should try Data Nerds.
● Do you like spending time around
people that love learning? Penny
University is the peer-to-peer
learning community for you!
● I just shared my talk
https://twitter.com/JnBrymn
This slide intentionally left blank.
DON'T FORGET
● Tweet the slides out just before the talk
● Open the notebooks
○ do
■ cd ~/Personal/data_science/tagging_events/
■ jupyter notebook
■ open the 3 notebooks in event_tagging_strategies
○ or just use gists: one, two, three
● Bump up the font size on the notebooks
● Remove the menus
● Clear cells

More Related Content

PPTX
Haystack keynote 2019: What is Search Relevance? - Max Irwin
PPTX
Keyword Research and Topic Modeling in a Semantic Web
PDF
A recommendation engine for your php application
PPTX
Understanding Queries through Entities
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
PPT
Related Entity Finding on the Web
PDF
Bigdataanalytics
PPTX
Searching with vectors
Haystack keynote 2019: What is Search Relevance? - Max Irwin
Keyword Research and Topic Modeling in a Semantic Web
A recommendation engine for your php application
Understanding Queries through Entities
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Related Entity Finding on the Web
Bigdataanalytics
Searching with vectors

What's hot (19)

PPTX
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
PDF
Understanding search engine algorithms
PPT
seo basic
PDF
Crowdsourced query augmentation through the semantic discovery of domain spec...
PDF
AI, Search, and the Disruption of Knowledge Management
PPTX
Humanizing The Machine
PPTX
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
PDF
Ranking in Google Since The Advent of The Knowledge Graph
PPTX
Pixel tags and tag management
PDF
DataKind SG sharing of our first DataDive
PDF
Semantics and Search by Upasna Gautam at PubCon Austin 2018
PPTX
Vectors in Search - Towards More Semantic Matching
PDF
Evolution of Search
PDF
Reflected Intelligence: Real world AI in Digital Transformation
PDF
Natural Language Search with Knowledge Graphs (Chicago Meetup)
PPTX
Searching for Meaning
PDF
Presentation 4 MCExtenders
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
PDF
Conductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For You
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Understanding search engine algorithms
seo basic
Crowdsourced query augmentation through the semantic discovery of domain spec...
AI, Search, and the Disruption of Knowledge Management
Humanizing The Machine
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Ranking in Google Since The Advent of The Knowledge Graph
Pixel tags and tag management
DataKind SG sharing of our first DataDive
Semantics and Search by Upasna Gautam at PubCon Austin 2018
Vectors in Search - Towards More Semantic Matching
Evolution of Search
Reflected Intelligence: Real world AI in Digital Transformation
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Searching for Meaning
Presentation 4 MCExtenders
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Conductor C3 2019 - A Sound Advantage: How Voice Search Works & Works For You
Ad

Similar to Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - John Berryman (20)

PPTX
Machine Learning - Startup weekend UCSB 2018
PPTX
Google advance features for power searching
PPT
The right path to making search relevant - Taxonomy Bootcamp London 2019
PDF
How To Do Technical Keyword Research For A New Website
PPTX
Performing Technical Keyword Research for a NEW Website
PPTX
The evolution of Search spscinci
PDF
Curtain call of zooey - what i've learned in yahoo
PDF
Guerrilla UX: Practical and Affordable Research
PDF
Personalized search
PPTX
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
PDF
Role of Data Science in eCommerce
PPTX
Search Engine PPT For Students and Professionals
PDF
Defining the Search Experience
PDF
Tool criticism
PDF
How To Keyword Research For SEO Content Planning
PDF
Test driven relevancy
PDF
Link Building in 2020 :: Use this Walk-through to Acquire & Earn Links that w...
PPT
National Wildlife Federation- OMS- Dreamcore 2011
PDF
PDF
ShopekLobek first term work summary
Machine Learning - Startup weekend UCSB 2018
Google advance features for power searching
The right path to making search relevant - Taxonomy Bootcamp London 2019
How To Do Technical Keyword Research For A New Website
Performing Technical Keyword Research for a NEW Website
The evolution of Search spscinci
Curtain call of zooey - what i've learned in yahoo
Guerrilla UX: Practical and Affordable Research
Personalized search
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Role of Data Science in eCommerce
Search Engine PPT For Students and Professionals
Defining the Search Experience
Tool criticism
How To Keyword Research For SEO Content Planning
Test driven relevancy
Link Building in 2020 :: Use this Walk-through to Acquire & Earn Links that w...
National Wildlife Federation- OMS- Dreamcore 2011
ShopekLobek first term work summary
Ad

More from OpenSource Connections (20)

PDF
Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024
PDF
How To Structure Your Search Team for Success
PDF
Payloads and OCR with Solr
PPTX
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
PDF
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
PPTX
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
PPTX
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
PDF
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
PPTX
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
PPTX
Haystack 2019 - Search with Vectors - Simon Hughes
PDF
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
PDF
Haystack 2019 - Architectural considerations on search relevancy in the conte...
PPTX
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
PPTX
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
PPTX
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
PDF
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
PDF
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
PDF
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
PDF
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Why User Behavior Insights? KMWorld Enterprise Search & Discovery 2024
How To Structure Your Search Team for Success
Payloads and OCR with Solr
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Introduction to Business Data Analytics.
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
1_Introduction to advance data techniques.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Mega Projects Data Mega Projects Data
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction-to-Cloud-ComputingFinal.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Miokarditis (Inflamasi pada Otot Jantung)
IBA_Chapter_11_Slides_Final_Accessible.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Business Acumen Training GuidePresentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Business Data Analytics.
.pdf is not working space design for the following data for the following dat...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
1_Introduction to advance data techniques.pptx
climate analysis of Dhaka ,Banglades.pptx
Supervised vs unsupervised machine learning algorithms
Mega Projects Data Mega Projects Data
oil_refinery_comprehensive_20250804084928 (1).pptx

Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - John Berryman

  • 1. Search Logs + Machine Learning = Automatic Tagging John Berryman
  • 2. Hi! I'm John Berryman @JnBrymn This is my name with all the unnecessary letters removed. ● Degree in Aerospace Engineering ● Moved into Search Technology ● Wrote a book (... well 40% of one) I got a haircut. Life has been different ever since. That's me on the cover. ● Discovery Engineer @ Eventbrite (Search/Recommendations)
  • 3. What is "tagging"? … and why would you want it?
  • 4. First, let's talk about e-commerce search ● Search is ubiquitous. ○ Search makes the internet accessible ○ Search is the backbone of many products ○ Search is embedded in most products ● E-commerce is powered by search ● Browse is an important aspect of the experience. You filter inventory based upon tags. ● Mobile users prefer browse over text search. ● Everyone is moving to mobile. These are tags! These are also tags!
  • 5. What is "tagging"? … and why would you want it? It's the ability to CATEGORIZE and UNDERSTAND your inventory. Because it powers the emerging dominant e-commerce interaction.
  • 6. How can you tag your inventory? ● Use curators to tag content: ○ Benefits: control over tagging, uniform tagging approach ○ Drawbacks: curation approach must be define, curators must be trained, curators are expensive ● Require tagging from content creators: ○ Benefits: content creators know their content the best, scales well ○ Drawback: content creators may not cooperate if they see no advantage for themselves ● Encourage customers to tag content: ○ Benefits: customers are the ones buying content and their idea of tags matters most ○ Drawbacks: there's even less likelihood for customers to cooperate … but what of nobody wants to tag your content?
  • 7. An Interesting Observation ● Every day millions of people search for events on Eventbrite. ● They issue approximately > 500K distinct queries in a month. ● But the most common 1,000 queries accounts for 41% of all search traffic. ● The common queries look like tags! Can we use logged searches as a training set to built a tagging model? ○ 5k run ○ back to school ○ job fair ○ 4th of july party ○ baby ○ real estate ○ car show ○ pool party ○ golf ○ gospel ○ speed dating ○ boat party ○ photography ○ dog ○ data science ○ business ○ kids ○ networking ○ christian ○ free
  • 8. Search Logs + Machine Learning = Automatic Tagging John Berryman
  • 9. Initial Approach ● Given – we have 3 tables: ○ search log ○ click log ○ event table ● Step 1: Find the most common 500 queries ● Step 2: Find all the events clicked after a user search using a common query ● Step 3: Collect the name and description of those events ● Create a training set: ○ X = input = title and body text of events ○ y = output = query string used to find them a.k.a. tags ● Train a model to predict y based on new X
  • 11. Problems with this Approach ● Near synonym tags: ○ memorial day ○ memorial day weekend events ○ memorial day weekend ● Small tag vocabulary ● Each event only gets 1 or 2 tags. Sometimes 0.
  • 12. Improved Approach ● A session may contain several queries. These queries are often related: ○ Spelling corrections ○ Word synonyms ○ Query Refinements or generalizations ● Idea: ○ Let's group statistically significant query strings together. ○ Then we can train the neural network based on the query string groups
  • 14. Things to Notice Benefits ● Much fewer near-synonyms (bitcoin, block chain, blockchai → blockchain) ● More sample data ○ v1 model - 500 most popular queries - 33% of query traffic ○ v2 model - 2649 most popular queries collapsed down to 681 - 52% of query traffic ● Broader tags Drawback ● Some of the clusters pull in very loosely related words ○ ai → blockchain ○ ozio → rosebar
  • 15. Tagging-Related Applications ● Power Faceted Search ● Infer relationship between tags ● Provide organizers tag recommendations ● Better understand supply and demand ● Apply tags to users for better recommendation ● Search Synonyms (e.g. misspellings)
  • 16. Future Work ● Better coverage ○ Currently reach 50% of our traffic with 2,500 queries. Long tail is long! > 500K distinct queries in a month ○ Model biases towards short tail labels - everything's a "day party" ○ Can't cover searches for an event that isn't in our inventory. ● Create real pipeline ● Build out all the cool ideas on the last slide
  • 18. Final notes: ● My jupyter notebooks are here: ○ First implementation ○ Query collapsing ○ Second implementation ○ Third implementation Data Nerds ● Want to learn data science with others? You should try Data Nerds. ● Do you like spending time around people that love learning? Penny University is the peer-to-peer learning community for you! ● I just shared my talk https://twitter.com/JnBrymn
  • 20. DON'T FORGET ● Tweet the slides out just before the talk ● Open the notebooks ○ do ■ cd ~/Personal/data_science/tagging_events/ ■ jupyter notebook ■ open the 3 notebooks in event_tagging_strategies ○ or just use gists: one, two, three ● Bump up the font size on the notebooks ● Remove the menus ● Clear cells

Editor's Notes

  • #2: I hope so (that we use logged searches as a training set to built a tagging model) because the title of the talk is ^
  • #7: "...but" - this is the situation that Eventbrite was in
  • #9: I hope so (that we use logged searches as a training set to built a tagging model) because the title of the talk is ^