SlideShare a Scribd company logo
SOLR 101
JavaZone 2012, Oslo, Sébastien Muller, Findwise
Agenda
 Introductions
 Enterprise search
 What is Solr, why choose it?
 Solr Terminology
 Main Solr Features
 How it works
 Anatomy of a Query
 Scalbility
 Case studies
            Sparebank1
            Komplett Group
Enterprise Search
  Search has become mission critical for most enterprises
            Intranet
            Web presence
            E-commerce


  Exponential growth of data


  Cost of not finding information
            Knowledge (sharing)
            Time
            Money


  Information blackhole
What is Solr?

 Official definition:


 
       “Solr is an open source enterprise search platform
 based on the 
 Lucene Java search library, with an HTTP
 interface using XML, 
 JSON or other formats. It provides hit
 highlighting, faceted 
 search, caching, replication, a web
 administration interface and 
many more features. It runs in a
 Java servlet container such as 
     Apache Tomcat.”



              http://lucene.apache.org/solr
What is Solr?
  Open-source, license-free search engine
  Uses Apache Lucene library and adds enterprise search server
   features and capabilities
  Web based application that processes requests and returns
   responses via HTTP
  Easy scalability and great performance
  Industry-tested worldwide
  Modern solution architecture based on XML and Java – easy to work
   with
  Well integrated with the ecosystem around Big Data, such as
   Hadoop (also Nutch, Tika).
Why choose Solr?
  “Buy” > Build


  Open source vs. Commercial solution
            Open source software is free
            Licensed software can be very expensive


  High quality and easily modifiable relevancy

  Very fast query and indexing performance

  Highly flexible data processing/transformation
Why choose Solr?
  Some challenges unique to open source…
            No guaranteed support or bug fixing from community
            No formal quality control or support for upgrades
            Limited support for less experienced developers



  Some benefits unique to open source…
            Widely used and tested
            Access to source code
            Access to development versions and unreleased patches



  Ultimately search is a specialised field and requires specialists
Solr Terminology
  Index(ing)
               Inverted index
  Document
  Field
               Stored and/or indexed fields
  Analysis
               Tokenization
               Filters
               Terms
  Query
               Filter
               Function
               Facet
Main Solr Features
  Full text search
  Field search
  Number and date searching
  Facets
  Spelling assistance – “Did you mean…?”
  Replication
             Master/Slave architecture

  Related hits
  Query completion
  Admin GUI
How it works
  Easy configuration through XML
               schema.xml
               solrconfig.xml


  Documents are POSTed via HTTP to Solr
               Add/update
               Delete
               Commit


      Queries and response are also sent via HTTP
               Choice of formats
Anatomy of a Query
  Common parameters
          Start, rows, fl, fq, sort
          http://wiki.apache.org/solr/CommonQueryParameters




 ?
 q=*:*&start=0&rows=10&fl=title&fq=collection:popular&s
 ort=title asc

   Slightly more advanced
          &facets
          &qf
What is Facetting?
  Navigation/discovery technique


  Tally of docs for each distinct field value


  Parameters
             &facet=true
             &facet.field=category


                                                And so much more…
Scalability
  Architecture goals:
            More queries per second (qps)
            Faster query execution
            Bigger indexes
            Faster indexing


  Scaling options
            Multicore
            Replication
            Sharding
Scalability - Multicore
  Having more than one Solr in one Solr webapp


          <solr persistent = “true” sharedLib = “lib” >
          
     <cores adminPath=“/admin/cores”>
          
     
         <core name=“core0” instanceDir=“core0” />
          
     
         <core name=“core1” instanceDir=“core1” />
          
     </cores
          </solr>


  http://localhost:8080/solr/admin/cores?action=...
           STATUS
           CREATE
           SWAP
Scalability - Replication
  Basic architecture – indexing/querying handled by one instance


  1:1 Master/slave
           Indexing
           Querying


  1:N Master/slaves
           Different user groups
Scalability - Sharding
  Distributed index
            N masters with index split between them
            Simple hashing to choose index


  Sharding + replication
             N masters with M slaves each
             More shards = faster execution time
             More slaves = higher average QPS




 
      &shards=solr1:8983/solr,solr2:8983/
 solr&indent=true&q=ipod+solr
Case Studies
SpareBank1 - Background
  SpareBank1 Gruppen
            19 individual localised bank portals and one parent front page


  Boost 25 umbrella project
            Semantic URLs: https://www2.sparebank1.no/9898/3_privat?
             _nfpb=true&_nfls=false&_pageLabel=page_privat_innhold&pId=1233
             149354625&_
            New search interface
            Banking app


   CMS with no easy way of tracking individual banks’
    publications
            Mass duplicates
            Access to irrelevant data
SpareBank1 - Requirements
  Customer requirements : “bedre portal søk”
SpareBank1 - Requirements
  Basic search features include


            High quality relevance and precision


            Relevant faceting


            Query completion


            Spell check and suggestions


            Search analytics
SpareBank1 – Live Demo




    https://www2.sparebank1.no/
Komplett - Background
  Komplett NO, SE, DK… inWarhouse.se, MPX


  Existing Solr solution
             Mile long query with boosting per field


  Poor relevance
             Peripherals/accessories ranked higher than products


   Limited faceting

   No query completion or spellcheck

   Sloooooow indexing
Komplett - Requirements
  Superior and customisable relevance model


  Much more comprehensive indexing of products and specifications


  Spellcheck


  Query completion


  So much more faceting
Sébastien Muller
sebastien.muller@findwise.com

More Related Content

PPTX
European climate and vegetation ppt
DOCX
4g interview-question
PPT
Punic wars
PPTX
Age of Justinian
PPT
Ch.4 ancient egypt and kush
PPTX
The Roman Civilization
PDF
Quick Summary of LTE Voice Summit 2015 #LTEVoice
PPT
How to perform trouble shooting based on counters
European climate and vegetation ppt
4g interview-question
Punic wars
Age of Justinian
Ch.4 ancient egypt and kush
The Roman Civilization
Quick Summary of LTE Voice Summit 2015 #LTEVoice
How to perform trouble shooting based on counters

Similar to Solr 101 (20)

PPTX
Apache Solr - search for everyone!
KEY
Apache Solr - Enterprise search platform
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PDF
Apache Solr crash course
PDF
Introduction to Solr
PDF
NoSQL, Apache SOLR and Apache Hadoop
KEY
Intro to Apache Solr for Drupal
PPTX
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
PDF
Rapid Prototyping with Solr
PDF
Solr Architecture
PDF
Basics of Solr and Solr Integration with AEM6
PDF
Rapid Prototyping with Solr
PDF
Solr @ eBay Kleinanzeigen
PDF
Rapid Prototyping with Solr
PPTX
Solr introduction
PPTX
Solr site search makes shopping simple
PPTX
Introduction to Apache Lucene/Solr
PPT
Building Intelligent Search Applications with Apache Solr and PHP5
PDF
Introduction to Solr
Apache Solr - search for everyone!
Apache Solr - Enterprise search platform
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Apache Solr crash course
Introduction to Solr
NoSQL, Apache SOLR and Apache Hadoop
Intro to Apache Solr for Drupal
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Rapid Prototyping with Solr
Solr Architecture
Basics of Solr and Solr Integration with AEM6
Rapid Prototyping with Solr
Solr @ eBay Kleinanzeigen
Rapid Prototyping with Solr
Solr introduction
Solr site search makes shopping simple
Introduction to Apache Lucene/Solr
Building Intelligent Search Applications with Apache Solr and PHP5
Introduction to Solr
Ad

More from Findwise (20)

PDF
White Arkitekter - Findability Day Roadshow 2017
PDF
AI och maskininlärning - Findability Day Roadshow 2017
PDF
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
PDF
Findwise and IBM Watson
PDF
Findability Day 2016 - Enterprise Search and Findability Survey 2016
PDF
Findability Day 2016 - Enterprise Search and Findability Survey 2016
PDF
Findability Day 2016 - Big data analytics and machine learning
PPTX
Findability Day 2016 - Enterprise social collaboration
PPTX
Findability Day 2016 - SKF case study
PPTX
Findability Day 2016 - Structuring content for user experience
PPTX
Findability Day 2016 - Augmented intelligence
PPTX
Findability Day 2016 - What is GDPR?
PPTX
Findability Day 2016 - Get started with GDPR
PPTX
Digital workplace och informationshantering i office 365
PPTX
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
PDF
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
PPTX
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
PPTX
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
PPTX
Findability Day 2015 - Martin White - The future is search!
PPTX
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
White Arkitekter - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
Findwise and IBM Watson
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Enterprise social collaboration
Findability Day 2016 - SKF case study
Findability Day 2016 - Structuring content for user experience
Findability Day 2016 - Augmented intelligence
Findability Day 2016 - What is GDPR?
Findability Day 2016 - Get started with GDPR
Digital workplace och informationshantering i office 365
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
Ad

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
A Presentation on Artificial Intelligence
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PDF
KodekX | Application Modernization Development
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Unlocking AI with Model Context Protocol (MCP)
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Dropbox Q2 2025 Financial Results & Investor Presentation
“AI and Expert System Decision Support & Business Intelligence Systems”
A Presentation on Artificial Intelligence
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
KodekX | Application Modernization Development
Understanding_Digital_Forensics_Presentation.pptx
Empathic Computing: Creating Shared Understanding
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Unlocking AI with Model Context Protocol (MCP)

Solr 101

  • 1. SOLR 101 JavaZone 2012, Oslo, Sébastien Muller, Findwise
  • 2. Agenda  Introductions  Enterprise search  What is Solr, why choose it?  Solr Terminology  Main Solr Features  How it works  Anatomy of a Query  Scalbility  Case studies  Sparebank1  Komplett Group
  • 3. Enterprise Search  Search has become mission critical for most enterprises  Intranet  Web presence  E-commerce  Exponential growth of data  Cost of not finding information  Knowledge (sharing)  Time  Money  Information blackhole
  • 4. What is Solr? Official definition: “Solr is an open source enterprise search platform based on the Lucene Java search library, with an HTTP interface using XML, JSON or other formats. It provides hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Apache Tomcat.” http://lucene.apache.org/solr
  • 5. What is Solr?  Open-source, license-free search engine  Uses Apache Lucene library and adds enterprise search server features and capabilities  Web based application that processes requests and returns responses via HTTP  Easy scalability and great performance  Industry-tested worldwide  Modern solution architecture based on XML and Java – easy to work with  Well integrated with the ecosystem around Big Data, such as Hadoop (also Nutch, Tika).
  • 6. Why choose Solr?  “Buy” > Build  Open source vs. Commercial solution  Open source software is free  Licensed software can be very expensive  High quality and easily modifiable relevancy  Very fast query and indexing performance  Highly flexible data processing/transformation
  • 7. Why choose Solr?  Some challenges unique to open source…  No guaranteed support or bug fixing from community  No formal quality control or support for upgrades  Limited support for less experienced developers  Some benefits unique to open source…  Widely used and tested  Access to source code  Access to development versions and unreleased patches  Ultimately search is a specialised field and requires specialists
  • 8. Solr Terminology  Index(ing)  Inverted index  Document  Field  Stored and/or indexed fields  Analysis  Tokenization  Filters  Terms  Query  Filter  Function  Facet
  • 9. Main Solr Features  Full text search  Field search  Number and date searching  Facets  Spelling assistance – “Did you mean…?”  Replication  Master/Slave architecture  Related hits  Query completion  Admin GUI
  • 10. How it works  Easy configuration through XML  schema.xml  solrconfig.xml  Documents are POSTed via HTTP to Solr  Add/update  Delete  Commit  Queries and response are also sent via HTTP  Choice of formats
  • 11. Anatomy of a Query  Common parameters  Start, rows, fl, fq, sort  http://wiki.apache.org/solr/CommonQueryParameters ? q=*:*&start=0&rows=10&fl=title&fq=collection:popular&s ort=title asc  Slightly more advanced  &facets  &qf
  • 12. What is Facetting?  Navigation/discovery technique  Tally of docs for each distinct field value  Parameters  &facet=true  &facet.field=category And so much more…
  • 13. Scalability  Architecture goals:  More queries per second (qps)  Faster query execution  Bigger indexes  Faster indexing  Scaling options  Multicore  Replication  Sharding
  • 14. Scalability - Multicore  Having more than one Solr in one Solr webapp <solr persistent = “true” sharedLib = “lib” > <cores adminPath=“/admin/cores”> <core name=“core0” instanceDir=“core0” /> <core name=“core1” instanceDir=“core1” /> </cores </solr>  http://localhost:8080/solr/admin/cores?action=...  STATUS  CREATE  SWAP
  • 15. Scalability - Replication  Basic architecture – indexing/querying handled by one instance  1:1 Master/slave  Indexing  Querying  1:N Master/slaves  Different user groups
  • 16. Scalability - Sharding  Distributed index  N masters with index split between them  Simple hashing to choose index  Sharding + replication  N masters with M slaves each  More shards = faster execution time  More slaves = higher average QPS &shards=solr1:8983/solr,solr2:8983/ solr&indent=true&q=ipod+solr
  • 18. SpareBank1 - Background  SpareBank1 Gruppen  19 individual localised bank portals and one parent front page  Boost 25 umbrella project  Semantic URLs: https://www2.sparebank1.no/9898/3_privat? _nfpb=true&_nfls=false&_pageLabel=page_privat_innhold&pId=1233 149354625&_  New search interface  Banking app  CMS with no easy way of tracking individual banks’ publications  Mass duplicates  Access to irrelevant data
  • 19. SpareBank1 - Requirements  Customer requirements : “bedre portal søk”
  • 20. SpareBank1 - Requirements  Basic search features include  High quality relevance and precision  Relevant faceting  Query completion  Spell check and suggestions  Search analytics
  • 21. SpareBank1 – Live Demo https://www2.sparebank1.no/
  • 22. Komplett - Background  Komplett NO, SE, DK… inWarhouse.se, MPX  Existing Solr solution  Mile long query with boosting per field  Poor relevance  Peripherals/accessories ranked higher than products  Limited faceting  No query completion or spellcheck  Sloooooow indexing
  • 23. Komplett - Requirements  Superior and customisable relevance model  Much more comprehensive indexing of products and specifications  Spellcheck  Query completion  So much more faceting

Editor's Notes

  • #2: \n
  • #3: Intros c= me &amp; solr\n
  • #4: Who &amp;#x2013; swiss, s&amp;#xE9;bastien muller, ex solr newbie, 1 yr w/ Solr almost daily, several projects\nWhat &amp;#x2013; work for findwise &amp;#xF0E0;&amp;#xF020;Findwizard, information access consultant&amp;#x2026;.. Enterprise search!\nWhere &amp;#x2013; Oslo for a year\nWhy &amp;#x2013; Oslo Solr Meetup community &amp;#xF0E0;&amp;#xF020;semi regular meetings at a pub in oslo\n\nINTRODUCTORY TALK\n
  • #5: Internally and/or externally, both for finding information or finding who has the information&amp;#x2026;. Ecommerce fail w/out search, cant find what you want to buy = no sale\nA lot of which is unstructured\nAccording to research performed by google approx. 85% of organisations can barely access less than 50% of the data they produce\n
  • #6: \n
  • #7: General description &amp;#x2013; the &amp;#x201C;sales pitch&amp;#x201D;\n\nWeb based app &amp;#x2013; runs in a servlet container and is deployed as a java war, works with all major application servers such as Tomcat or Jetty\n
  • #8: No point in building when there&amp;#x2019;s open source and licensed options readily available\nOpen source = free but might end up spending a lot to get it to do what you want &amp;#x2013; no vendor lock in, complete customisability\nLicensed = expensive and likely to spend a lot still\n
  • #9: Although it being open source allows for a very low barrier for adoption there is no Service Level Agreement with an open source community\nOpen Source community based project likely to yield better long term QA testing, more personally invested in the quality of the project, but life &gt; *\n\nIssue tracking is public\nAccess to source code &amp;#x2013; but the documentation isn&amp;#x2019;t necessarily comprehensive&amp;#x2026; google :D\n
  • #10: Inverted index &amp;#x2013; like a book = list of keywords paired with location -&gt; makes for v. fast queries rather than searching through documents for specific terms\n\nDocument = collection of fields with optional boosting values &amp;#xF0DF;&amp;#xF020;book&amp;#x2026;. Page&amp;#x2026; database entry etc&amp;#xF020;&amp;#xF0E0;&amp;#xF020;represented by a single result or hit\n\nSingle/multi valued\n\nStored = original pre-analysis value stored and returnable by queries, necessary for some features, increases index size &amp;#xF0E0;&amp;#xF020;store and retrieve, unless indexed too\n\nIndexed = searchable and facetable, unless stored will not be returned in a search\n\nTokenization = breaking up text sequences, filtering and trasnforming to generate &amp;#x201C;terms&amp;#x201D; that are tied to a specific field\n\nRestricts the search space by creating subset of indexed documents against which queries can be made\n\nContributes to relevance calculations in query time, can be customised\n
  • #11: \n
  • #12: Schema = define data and field types\n\nSolrconfig = search components, replication, request handlers etc\n\nDEMO schema/solrconfig in notepad++ and DEMO POSTing from netbeans/browser\n
  • #13: Sort on relevance score, value of fields&amp;#x2026;.\n\nIf there is a tie docs are sorted by date added (indexed time)\n\nMore shiney examples to follow\n
  • #14: Sort on relevance score, value of fields&amp;#x2026;.\n\nIf there is a tie docs are sorted by date added (indexed time)\n\nMore shiney examples to follow\n
  • #15: Sort on relevance score, value of fields&amp;#x2026;.\n\nIf there is a tie docs are sorted by date added (indexed time)\n\nMore shiney examples to follow\n
  • #16: Single solr instance with separate configurations (schema and config) and indexes while maintaining the convinience of unified administration\n\nEasy to add new cores or even replace cores with each other\n\nSTATUS, create, swap, unload, alias rename\n\nAtomically swaps the names used to access two existing cores. This can be useful for replacing a &quot;live&quot; core with an &quot;ondeck&quot; core, and keeping the old &quot;live&quot; core running in case you decide to roll-back.\n
  • #17: Basic &amp;#x2013; good starting point, fine for a small index with few updates and low query load as all updates will slow down querying\n\nMaster/slave &amp;#x2013; indexing on one, querying on the other, all replications slow down indexing, can modify replication interval &amp;#x2013; improves query speed/qps\n\n\n1:N &amp;#x2013; more qps, no more query speed necessarily -&gt; both require index to remain on 1 machine\n
  • #18: Updates go to one of N machines &amp;#x2013; unique key field must be unique across all shards &amp;#x2013; couple of features aren&amp;#x2019;t supported eg. More like this and joins\n\nMakes it easier to rebalance index operations across more servers\n\nshards=solr1:8983/solr,solr2:8983/solr&amp;indent=true&amp;q=ipod+solr -&gt; shards parameter syntax -&gt; can be added to a requestHandler specifically for shards\n
  • #19: Norwegian Bank and Norwegian based e-commerce group of sites\n
  • #20: One bank portal of about 1.5k docs, c. 50% were duplicates\n\nGroup publications made globally available via CMS but individual banks are under no obligation to publish articles and there&amp;#x2019;s no indication as to whether they had or not\n
  • #21: Norwegian Bank and Norwegian based e-commerce group of sites\n
  • #22: One bank portal of about 1.5k docs, c. 50% were duplicates\n\nGroup publications made globally available via CMS but individual banks are under no obligation to publish articles and there&amp;#x2019;s no indication as to whether they had or not\n
  • #23: Semantic FAIL\n
  • #24: \n
  • #25: \n
  • #26: One bank portal of about 1.5k docs, c. 50% were duplicates\n\nGroup publications made globally available via CMS but individual banks are under no obligation to publish articles and there&amp;#x2019;s no indication as to whether they had or not\n
  • #27: \n