SlideShare a Scribd company logo
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr and Lucene @ AOL
SEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
1999
• Believe, Cher and Livin’ la Vida Loca, Ricky Martin
• The Matrix and The Phantom Menace
• Windows 98 Second Edition
• AltaVista, Northern Light, Yahoo, ODP, Inktomi
– Google
• PPC Text search ads invented 1998
– Banner ads
A Brief History of Search @ AOL
• Acquired PLS in 1998
• AOL Search used ODP
• Site Search
• Local Search
• Built into AOL Server
• CPL
– VSM then BM25
– Phrase, numeric, date, text, and
proximity boosting
– Conflation classes (like synonyms)
Relevance
• Precision/recall
• “free alcohol” vs. “alcohol free”
• Lawyer versus Attorney
• Iron and ironic  same stem (Porter)
• Beyonce vs. Beyoncé
• Eagles
–Bird, sports teams, band, AMC Eagle
• F 15, F-15, F15
• FREAK
Relevant Retrieved
The Dawn of Solr
• Prohibitively expensive to continue CPL development
• Complicated deployment
• 2005: Investigating migration to Lucene
• 2006: CNET open sourced Solr
Contributions
• Local Lucene/Solr (superseded by SpatialSearch)
• Query Timeout
• Data Import Handler (DIH)
• Numerous smaller patches
• Committers: Noble Paul, Shalin Mangar, Patrick
O’Leary
Contributing to Solr/Lucene
• Learn
–Join the mailing lists
•solr-user@lucene.apache.org
•dev@lucene.apache.org
–Read search and Solr related blogs
–The #solr IRC channel on freenode
Contributing to Solr/Lucene
• Help others
–Answer questions.
–Improve documentation in the code, the wiki, or
the website.
–Make improvements to the Solr Admin UI.
Contributing to Solr/Lucene
• Confirm a bug
• Submit a patch for a reported bug or feature
request
• Improve a patch
• Try out a patch and see if it works
Contributing to Solr/Lucene
• Submit your own tickets
– Bug
– Feature request
• Start with solr-user@lucene
• Discuss on dev@lucene
• Create Jira ticket, ideally with patches and unit tests
• Yonik’s Law of Patches:
– A half-baked patch in Jira, with no documentation, no tests, and no
backwards compatibility is better than no patch at all.
Applications
• MapQuest (SpatialSearch)
• Mail
• AIM
• AOL Search
• Site Search
• News Search
• RUM
• Sarah Palin e-mails (admin)
• Demand
• Wikipedia article pattern detection
MapQuest Discover
Travel Blogs
MQ Local Search
Related Searches
Bipartite graph snippet
Related Searches Graph
Page 18
“The Eagles”
The band
NFL
Boston College
Hotel California
Tribute
Related Searches
• Simple query
– User
• New York Library
– Solr query
• Lower case
• Prefer exact match “new york library”
• Use phrase slop to allow terms in same order and near each
other, e.g., new york city public library
• primeQuery:“new york library” OR “new york library”~3
Wikipedia Traffic Correlation Schema
<field name="title" type="string" indexed="true" stored="true" required="true" />
<field name="title_norm" type="string" indexed="true" stored="true" required="true" />
<field name="total_pvs" type="long" indexed="true" stored="true" required="true" />
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used. -->
<!-- trend direction. field name contains date string, e.g., "trend_20110622" -->
<dynamicField name="trend_*" type="int" indexed="true" stored="true"/>
<!-- page views. field name contains date string, e.g., "pvs_20110622" -->
<dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>
Temporal Traffic Correlation of Wikipedia Page
Views
Sarah Palin E-mail Stats
• 13,177 documents
• 4 hours from receiving data to production install
• ~150 K requests per day at launch
• Now about 6-7 K requests per day
• Running on 3 VMs in two different data centers
behind a NetScaler
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Faceting and Clustering
Huffington Post Comments
• Solr 4
• Uses Solr Cloud
• Single shard
• ReplicationFactor 3
• Real-time
• 90 days of comments
• Tested up to 100 writes / second
More HuffPost comments
• Used by editors and moderators
–Topic investigation
–Troll detection
• Config
–Special features: search for emoticons, prefer
exact match, date boosting
• Hack-a-thon comment clustering, timeline, and
summarization
Solr Comments Architecture
Message
Queue
MongoDB
Mongo
Ingestor
Solr
Ingestor
Solr Cloud
Uses SolrJ CloudSolrServer
Tools
Server
JuLiA
Relevance in Solr
• “free alcohol” vs. “alcohol free”
–Phrase queries and phrase slop
• Lawyer versus Attorney
–SynonymFilterFactory
• Iron and ironic
–Kstem, or Lemmatization via the
SynonymFilterFactory instead of
Snowball/Porter
Relevance in Solr
• Beyonce vs. Beyoncé
–Various Folding Filters
• Eagles
–Boost on other fields, such as
popularity, publish date
–Use related searches, facets, or clustering
• F 15, F-15, F15
–WordDelimiterFilter
Bringing a New Search Project Online
• Understand the domain
• Ingest (sample) data
• Clean data
• Repeat
• Relevance testing
• Scale out
• Launch/Success

More Related Content

PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
PDF
Dawid Weiss- Finite state automata in lucene
PPT
Finite State Queries In Lucene
PPTX
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
PDF
Integration of apache solr with crawlers
PDF
Lucene rev preso bialecki solr crawlers-lr
PDF
"Search, APIs,Capability Management and the Sensis Journey"
PPTX
Updated: You Have An Idea ... Do You Have A Business?
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Dawid Weiss- Finite state automata in lucene
Finite State Queries In Lucene
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Integration of apache solr with crawlers
Lucene rev preso bialecki solr crawlers-lr
"Search, APIs,Capability Management and the Sensis Journey"
Updated: You Have An Idea ... Do You Have A Business?

Viewers also liked (18)

PPTX
Coterie 9 11
PPTX
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
PPTX
O asis1 2[1]
PDF
What’s new in apache lucene 3.0
PPTX
20101023 ie9 cache
PPTX
Maroon5
PPT
Tennis
PPTX
All the lovers
PPTX
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
PPT
Mains aux fleurs
PPTX
Нестандартные методы интернет рекламы
PDF
IAMAS 2010 First presentation
PDF
Highly Relevant Search Result Ranking for Law Enforcement
PPTX
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
PPT
Descritores de linguagem
PPTX
Network Forensics Puzzle Contest に挑戦 #1
PPT
Hellosong
PDF
Searching The United States Code with Solr/Lucene
Coterie 9 11
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
O asis1 2[1]
What’s new in apache lucene 3.0
20101023 ie9 cache
Maroon5
Tennis
All the lovers
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Mains aux fleurs
Нестандартные методы интернет рекламы
IAMAS 2010 First presentation
Highly Relevant Search Result Ranking for Law Enforcement
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Descritores de linguagem
Network Forensics Puzzle Contest に挑戦 #1
Hellosong
Searching The United States Code with Solr/Lucene
Ad

Similar to Solr At AOL, Presented by Sean Timm at SolrExchage DC (20)

PPT
Intro to Solr in Drupal
PPTX
Introduction to Apache Lucene/Solr
PDF
Introduction to Solr
PPTX
Intro to Apache Lucene and Solr
PPTX
Introduction to Lucene & Solr and Usecases
KEY
ApacheCon Europe 2012 -Big Search 4 Big Data
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
PDF
KEYNOTE: Lucene / Solr road map
PDF
Introduction to Solr
PPTX
Apache Solr - search for everyone!
PDF
Rapid Prototyping with Solr
PDF
Shally source con2012
PDF
Small wins in a small time with Apache Solr
PPTX
Solr site search makes shopping simple
PDF
KEY
Intro to Apache Solr for Drupal
PDF
Things Made Easy: One Click CMS Integration with Solr & Drupal
PDF
Rapid Prototyping with Solr
PDF
Find it, possibly also near you!
PPTX
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Intro to Solr in Drupal
Introduction to Apache Lucene/Solr
Introduction to Solr
Intro to Apache Lucene and Solr
Introduction to Lucene & Solr and Usecases
ApacheCon Europe 2012 -Big Search 4 Big Data
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
KEYNOTE: Lucene / Solr road map
Introduction to Solr
Apache Solr - search for everyone!
Rapid Prototyping with Solr
Shally source con2012
Small wins in a small time with Apache Solr
Solr site search makes shopping simple
Intro to Apache Solr for Drupal
Things Made Easy: One Click CMS Integration with Solr & Drupal
Rapid Prototyping with Solr
Find it, possibly also near you!
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Ad

More from Lucidworks (Archived) (20)

PDF
Integrating Hadoop & Solr
PDF
The Data-Driven Paradigm
PDF
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
PDF
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
PPTX
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PPTX
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
PPTX
What's new in solr june 2014
PPTX
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
PPTX
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
PDF
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
PDF
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
PDF
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
PPTX
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
PPTX
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
PPTX
Building a data driven search application with LucidWorks SiLK
PPTX
Introducing LucidWorks App for Splunk Enterprise webinar
PDF
Solr4 nosql search_server_2013
PPTX
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
PDF
Seeley yonik solr performance key innovations
Integrating Hadoop & Solr
The Data-Driven Paradigm
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
What's new in solr june 2014
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Building a data driven search application with LucidWorks SiLK
Introducing LucidWorks App for Splunk Enterprise webinar
Solr4 nosql search_server_2013
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Seeley yonik solr performance key innovations

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
A Presentation on Artificial Intelligence
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPT
Teaching material agriculture food technology
PDF
cuic standard and advanced reporting.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Electronic commerce courselecture one. Pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
A Presentation on Artificial Intelligence
Building Integrated photovoltaic BIPV_UPV.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Teaching material agriculture food technology
cuic standard and advanced reporting.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Per capita expenditure prediction using model stacking based on satellite ima...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Digital-Transformation-Roadmap-for-Companies.pptx
Empathic Computing: Creating Shared Understanding
Machine learning based COVID-19 study performance prediction
NewMind AI Weekly Chronicles - August'25 Week I
Electronic commerce courselecture one. Pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
Understanding_Digital_Forensics_Presentation.pptx

Solr At AOL, Presented by Sean Timm at SolrExchage DC

  • 2. Solr and Lucene @ AOL SEAN TIMM, CHIEF ARCHITECT, AOL ADVERTISING
  • 3. 1999 • Believe, Cher and Livin’ la Vida Loca, Ricky Martin • The Matrix and The Phantom Menace • Windows 98 Second Edition • AltaVista, Northern Light, Yahoo, ODP, Inktomi – Google • PPC Text search ads invented 1998 – Banner ads
  • 4. A Brief History of Search @ AOL • Acquired PLS in 1998 • AOL Search used ODP • Site Search • Local Search • Built into AOL Server • CPL – VSM then BM25 – Phrase, numeric, date, text, and proximity boosting – Conflation classes (like synonyms)
  • 5. Relevance • Precision/recall • “free alcohol” vs. “alcohol free” • Lawyer versus Attorney • Iron and ironic  same stem (Porter) • Beyonce vs. Beyoncé • Eagles –Bird, sports teams, band, AMC Eagle • F 15, F-15, F15 • FREAK Relevant Retrieved
  • 6. The Dawn of Solr • Prohibitively expensive to continue CPL development • Complicated deployment • 2005: Investigating migration to Lucene • 2006: CNET open sourced Solr
  • 7. Contributions • Local Lucene/Solr (superseded by SpatialSearch) • Query Timeout • Data Import Handler (DIH) • Numerous smaller patches • Committers: Noble Paul, Shalin Mangar, Patrick O’Leary
  • 8. Contributing to Solr/Lucene • Learn –Join the mailing lists •solr-user@lucene.apache.org •dev@lucene.apache.org –Read search and Solr related blogs –The #solr IRC channel on freenode
  • 9. Contributing to Solr/Lucene • Help others –Answer questions. –Improve documentation in the code, the wiki, or the website. –Make improvements to the Solr Admin UI.
  • 10. Contributing to Solr/Lucene • Confirm a bug • Submit a patch for a reported bug or feature request • Improve a patch • Try out a patch and see if it works
  • 11. Contributing to Solr/Lucene • Submit your own tickets – Bug – Feature request • Start with solr-user@lucene • Discuss on dev@lucene • Create Jira ticket, ideally with patches and unit tests • Yonik’s Law of Patches: – A half-baked patch in Jira, with no documentation, no tests, and no backwards compatibility is better than no patch at all.
  • 12. Applications • MapQuest (SpatialSearch) • Mail • AIM • AOL Search • Site Search • News Search • RUM • Sarah Palin e-mails (admin) • Demand • Wikipedia article pattern detection
  • 18. Related Searches Graph Page 18 “The Eagles” The band NFL Boston College Hotel California Tribute
  • 19. Related Searches • Simple query – User • New York Library – Solr query • Lower case • Prefer exact match “new york library” • Use phrase slop to allow terms in same order and near each other, e.g., new york city public library • primeQuery:“new york library” OR “new york library”~3
  • 20. Wikipedia Traffic Correlation Schema <field name="title" type="string" indexed="true" stored="true" required="true" /> <field name="title_norm" type="string" indexed="true" stored="true" required="true" /> <field name="total_pvs" type="long" indexed="true" stored="true" required="true" /> <!-- Dynamic field definitions. If a field name is not found, dynamicFields will be used if the name matches any of the patterns. RESTRICTION: the glob-like pattern in the name attribute must have a "*" only at the start or the end. EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i) Longer patterns will be matched first. if equal size patterns both match, the first appearing in the schema will be used. --> <!-- trend direction. field name contains date string, e.g., "trend_20110622" --> <dynamicField name="trend_*" type="int" indexed="true" stored="true"/> <!-- page views. field name contains date string, e.g., "pvs_20110622" --> <dynamicField name="pvs_*" type="long" indexed="true" stored="true"/>
  • 21. Temporal Traffic Correlation of Wikipedia Page Views
  • 22. Sarah Palin E-mail Stats • 13,177 documents • 4 hours from receiving data to production install • ~150 K requests per day at launch • Now about 6-7 K requests per day • Running on 3 VMs in two different data centers behind a NetScaler
  • 25. Huffington Post Comments • Solr 4 • Uses Solr Cloud • Single shard • ReplicationFactor 3 • Real-time • 90 days of comments • Tested up to 100 writes / second
  • 26. More HuffPost comments • Used by editors and moderators –Topic investigation –Troll detection • Config –Special features: search for emoticons, prefer exact match, date boosting • Hack-a-thon comment clustering, timeline, and summarization
  • 27. Solr Comments Architecture Message Queue MongoDB Mongo Ingestor Solr Ingestor Solr Cloud Uses SolrJ CloudSolrServer Tools Server JuLiA
  • 28. Relevance in Solr • “free alcohol” vs. “alcohol free” –Phrase queries and phrase slop • Lawyer versus Attorney –SynonymFilterFactory • Iron and ironic –Kstem, or Lemmatization via the SynonymFilterFactory instead of Snowball/Porter
  • 29. Relevance in Solr • Beyonce vs. Beyoncé –Various Folding Filters • Eagles –Boost on other fields, such as popularity, publish date –Use related searches, facets, or clustering • F 15, F-15, F15 –WordDelimiterFilter
  • 30. Bringing a New Search Project Online • Understand the domain • Ingest (sample) data • Clean data • Repeat • Relevance testing • Scale out • Launch/Success