SlideShare a Scribd company logo
luceneSolr = new
                                LuceneSolr(4.x)

                                Grant Ingersoll
                                CTO, LucidWorks




Confidential © Copyright 2012
Search is dead, long live search


    • Embrace fuzziness!

    • Search is a system building
      block

    • If the algorithms fit,
                use them!

    • Search use leads to search
      abuse
        - Denormalization frees your mind
        - Scoring is just a sparse matrix
          multiply
                                            http://cheezburger.com/5243950080
    • Scoring features are
      everywhere
    Confidential and Proprietary
2   © 2012 LucidWorks
Search (R)evolution


• “T’ain’t your father’s search engine”
    - Non free text usages abound

• NoSQL before NoSQL was cool
    - Many DB-like features


• Flexibility during indexing and scoring

• Finite State Transducers FTW!

• Scale


Confidential and Proprietary
© 2012 LucidWorks
Agenda

• What’s new In Lucene 4?

• What’s new in Solr 4?

• Sneak Peek: what’s ahead?




Confidential and Proprietary
© 2012 LucidWorks
Lucene 4




Confidential © Copyright 2012
Up and to the Right




    • http://people.apache.org/~mikemccand/lucenebench/in
      dexing.html

    Confidential and Proprietary
6   © 2012 LucidWorks
Lucene: Flexibility

• Flexible Index Formats
    - New posting list codecs: Block, Simple Text, Append (HDFS..),
      etc
    - Pulsing codec: improves performance of primary key searches,
      inlining docs, positions, and payloads, saves disk seeks


• Pluggable Scoring
    - Decoupled from TF/IDF
    - Built in alternatives include BM25 & DFR
          » http://en.wikipedia.org/wiki/Okapi_BM25
          » http://terrier.org/docs/v3.5/dfr_description.html




Confidential and Proprietary
© 2012 LucidWorks
Lucene: Speed and Memory

• Native Near Real Time (NRT) support
    - Per segment
    - FieldCache can be controlled to only load new segments
• Soft commit
    - Faster without fsync, allows quicker update visibility
• DWPT (Document Writer per Thread)
    - Faster more consistent index speed
• Faster fuzzy & wildcard query processing
    - Higher performance searching
• String -> BytesRef
    - Much improved data structure
    - … means less memory and less garbage collection effort

Confidential and Proprietary
© 2012 LucidWorks
BytesRef memory management improvements


    • On a Wikipedia index (11M documents)
        - Time to perform the first query with sorting (no warmup queries)
          Solr 3x: 13 seconds, Solr 4: 6 seconds.

        - Memory consumption Solr 3x: 1,040M, Solr 4: 366M.

        - Number of objects on the heap. Solr 3x: 19.4M, Solr 4: 80K. No,
          that’s not a typo.

        - http://searchhub.org/2012/04/06/memory-comparisons-between-
          solr-3x-and-trunk/




    Confidential and Proprietary
9   © 2012 LucidWorks
FuzzyQuery

     • http://people.apache.org/~mikemccand/lucenebench/F
       uzzy2.html




     Confidential and Proprietary
10   © 2012 LucidWorks
QPS (primary key lookup)

     • http://people.apache.org/~mikemccand/lucenebench/P
       KLookup.html




     Confidential and Proprietary
11   © 2012 LucidWorks
Lucene: Features

• Doc Values                          • DirectSpellChecker
    - Store data in column order       - No more sidecar index!
    - Tradeoffs when using vs.        • Geospatial improvements
      FieldCache                        (more later)

    - http://searchhub.org/2013/04
      /02/fun-with-docvalues-in-
      solr-4-2/
    - http://www.slideshare.net/luc
      enerevolution/willnauer-
      simon-doc-values-column-
      stride-fields-in-lucene



Confidential and Proprietary
© 2012 LucidWorks
Solr 4




Confidential © Copyright 2012
Solr 4: Features

• Search/Faceting/Relevance
    -   New Relevance Function Queries (tf, df, others)
    -   Pivot Faceting
    -   Pseudo-join
    -   DirectSpellChecker support
    -   Improved Spatial (more later)
• Indexing
    - New Update Processors, including scripting option
    - NRT
• Other
    - DocTransformer pluggability
    - New Admin UI


Confidential and Proprietary
© 2012 LucidWorks
Geospatial improvements

• Multiple values per field
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point in a circle

• Indexing:
    - "geo”:”43.17614,-90.57341”
    - “geo”:”Circle(4.56,1.23 d=0.0710)”
    - “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”
• Searching:
    - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
    - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0
      0, -10 30)))”

Confidential and Proprietary
© 2012 LucidWorks
/solr




Confidential and Proprietary
© 2012 LucidWorks
SolrCloud

• Distributed/sharded indexing & search
    - Auto distributes updates and queries to appropriate shards
    - Near Real Time (NRT) indexing capable
• Dynamically scalable
    - New SolrCloud instances add indexing and query capacity
• Reliable
    - No single point of failure
    - Transactions logged
    - Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud



Confidential and Proprietary
© 2012 LucidWorks
Confidential and Proprietary
18   © 2012 LucidWorks
SolrCloud’s capabilities

• Transaction log
    - All updates are added to the transaction log. The tlog provides support for: durability for
      updates that have not yet been committed, peer syncing, real-time get (retrieve documents
      by unique id) always up to date because it checks the tlog first, does not require opening a
      new searcher to see changes
• Near Real Time (NRT) indexing
    - Soft commits make updates visible
    - Hard commits make updates durable
• Durability
    - Updates to Solr may be in several different states: buffered in memory, flushed, but not
      committed or viewable, soft committed (flushed and viewable), committed (durable)
    - The transaction log ensures data is not lost in any of these states if Solr crashes.
• Recovery
    - Solr uses the transaction log for recovery; on startup Solr checks to see if the tlog is in a
      committed state, if not updates since the last commit are applied
• Optimistic locking
    - Solr maintains a document version (_version_ field); updates can now specify _version_;
      updates to incorrect version will fail


Confidential and Proprietary
© 2012 LucidWorks
SolrCloud details

     • “Leaders” and “replicas”
         - Leaders are automatically elected
     • Leaders are just a replica with some coordination
       responsibilities for the associated replicas
     • If a leader goes down, one of the associated replicas is
       elected as the new leader
     • New nodes are automatically assigned a shard and
       role, and replicate/recover as needed
     • SolrJ’s CloudSolrServer
     • Replication in Solr 4
         - Used for new and recovering replicas
         - Or for traditional master/slave configuration

     Confidential and Proprietary
20   © 2012 LucidWorks
Solr as NoSQL

• Characteristics
    -   Non-traditional data stores
    -   Not designed for SQL type queries
    -   Distributed fault tolerant architecture
    -   Document oriented, data format agnostic(JSON, XML, CSV,
        binary)
• Updated durability via transaction log
• Real-time /get fetches latest version w/o hard commit
• Versioning and optimistic locking
    - w/ Real Time GET, allows read/write/update w/o conflicts
• Atomic updates
    - Can add/remove/change and increment a field in existing doc
      w/o re-indexing


Confidential and Proprietary
© 2012 LucidWorks
Distributed Key / Value Pair Database

     • Real-time Get combined with Solr Cloud make a very
       powerful key/value pair database
         -   Durable (tlog)
         -   Isolated (Optimistic locking)
         -   Redundant (Solr Cloud Replicas)
         -   Distributed & scalable (billions of keys, Solr Cloud Sharding)
         -   Efficient Multi-tenant (Solr Cloud document routing, Solr 4.1)
         -   Fast (milli-second response time, Pulsing Codec)
         -   Real-time (tlog)




     Confidential and Proprietary
22   © 2012 LucidWorks
Routing

     • Allows you to route documents and queries to a subset
       of shards
     • Provides efficient multi-tenancy
     • Indexing:
         - A shard key can be prepended to the unique document id:
           shard_key!unique_id
         - Documents with the same shard_key will reside on the same
           shard.
     • Querying: shard.keys=shard_key1!...
         - Much more efficient then searching the entire collection.




     Confidential and Proprietary
23   © 2012 LucidWorks
Looking ahead

• Automatic shard splitting
• Query parsing: rich query tree control via JSON/XML

• “Schemaless”
    - Marketing term meaning convention over configuration for fields


• More programmatic control over system

• Continually improving performance, scalability, and
  robustness


Confidential and Proprietary
© 2012 LucidWorks
• Want to learn more?

     • Join us in San Diego April 29 – May 2, 2013

     • http://lucenerevolution.org/

     • http://lucenerevolution.org/2013/agenda




     Confidential and Proprietary
25   © 2012 LucidWorks
Resources

• Lucene/Solr
    - http://lucene.apache.org


• Me
    - @gsingers, grant@lucidworks.com
    - http://www.manning.com/ingersoll


• Company
    - http://www.lucidworks.com
    - http://www.searchhub.org
    - Products, Support, Training on and around Lucene and Solr



Confidential and Proprietary
© 2012 LucidWorks

More Related Content

PPTX
Python and Oracle : allies for best of data management
PPTX
High density deployments using weblogic multitenancy
PDF
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
PDF
MySQL High-Availability and Scale-Out architectures
PDF
Python Utilities for Managing MySQL Databases
PDF
Docker Concepts for Oracle/MySQL DBAs and DevOps
PPTX
Oct meetup open stack 101 clean
PPTX
Brk2051 sql server on linux and docker
Python and Oracle : allies for best of data management
High density deployments using weblogic multitenancy
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
MySQL High-Availability and Scale-Out architectures
Python Utilities for Managing MySQL Databases
Docker Concepts for Oracle/MySQL DBAs and DevOps
Oct meetup open stack 101 clean
Brk2051 sql server on linux and docker

What's hot (20)

PDF
Vijfhart thema-avond-oracle-12c-new-features
DOCX
Learning Oracle with Oracle VM VirtualBox Whitepaper
PDF
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
PPTX
Overview of some popular distributed databases
PPTX
Structor - Automated Building of Virtual Hadoop Clusters
PDF
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
PDF
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
PPTX
Lessons Learned from Dockerizing Spark Workloads
PDF
Oracle Exadata Performance: Latest Improvements and Less Known Features
PDF
2008 2086 Gangler
PDF
Oracle to Postgres Migration - part 2
PDF
Oracle WebLogic Server 12c with Docker
PPTX
Best Practices for Enterprise Continuous Delivery of Oracle Fusion Middlewa...
PDF
Database TCO
PDF
How SolrCloud Changes the User Experience In a Sharded Environment
PPT
Spil Games @ FOSDEM: Galera Replicator IRL
PDF
WebLogic on ODA - Oracle Open World 2013
PDF
01 upgrade to my sql8
PDF
MySQL Performance - Best practices
Vijfhart thema-avond-oracle-12c-new-features
Learning Oracle with Oracle VM VirtualBox Whitepaper
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
Overview of some popular distributed databases
Structor - Automated Building of Virtual Hadoop Clusters
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...
Ramp-Tutorial for MYSQL Cluster - Scaling with Continuous Availability
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Lessons Learned from Dockerizing Spark Workloads
Oracle Exadata Performance: Latest Improvements and Less Known Features
2008 2086 Gangler
Oracle to Postgres Migration - part 2
Oracle WebLogic Server 12c with Docker
Best Practices for Enterprise Continuous Delivery of Oracle Fusion Middlewa...
Database TCO
How SolrCloud Changes the User Experience In a Sharded Environment
Spil Games @ FOSDEM: Galera Replicator IRL
WebLogic on ODA - Oracle Open World 2013
01 upgrade to my sql8
MySQL Performance - Best practices
Ad

Viewers also liked (12)

PPTX
Open Source Search FTW
PPTX
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
PPTX
Apache Lucene 4
PPTX
Crowd Sourced Reflected Intelligence for Solr and Hadoop
PPTX
Intro to Search
PPTX
Leveraging Solr and Mahout
PDF
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
PPTX
Data IO: Next Generation Search with Lucene and Solr 4
PPTX
OpenSearchLab and the Lucene Ecosystem
PPTX
This Ain't Your Parent's Search Engine
PDF
Solr for Data Science
PPTX
Taming Text
Open Source Search FTW
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Apache Lucene 4
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Intro to Search
Leveraging Solr and Mahout
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Data IO: Next Generation Search with Lucene and Solr 4
OpenSearchLab and the Lucene Ecosystem
This Ain't Your Parent's Search Engine
Solr for Data Science
Taming Text
Ad

Similar to What's new in Lucene and Solr 4.x (20)

PDF
Solr 4
PDF
KEYNOTE: Lucene / Solr road map
PDF
"Solr Update" at code4lib '13 - Chicago
PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PDF
Data Science with Solr and Spark
PDF
Introduction to Solr
PDF
PDF
Rapid Prototyping with Solr
PDF
Integrating Hadoop & Solr
PDF
NoSQL, Apache SOLR and Apache Hadoop
PPTX
This Ain't Your Parents' Search Engine
PPTX
Introduction to Apache Lucene/Solr
PDF
Rapid Prototyping with Solr
PDF
Inside Solr 5 - Bangalore Solr/Lucene Meetup
PDF
Apache Solr crash course
PDF
Introduction to Solr
PDF
Building Lanyrd
PDF
Webinar: Inside Apache Solr 5
PPTX
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Solr 4
KEYNOTE: Lucene / Solr road map
"Solr Update" at code4lib '13 - Chicago
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Data Science with Solr and Spark
Introduction to Solr
Rapid Prototyping with Solr
Integrating Hadoop & Solr
NoSQL, Apache SOLR and Apache Hadoop
This Ain't Your Parents' Search Engine
Introduction to Apache Lucene/Solr
Rapid Prototyping with Solr
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Apache Solr crash course
Introduction to Solr
Building Lanyrd
Webinar: Inside Apache Solr 5
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine

More from Grant Ingersoll (11)

PPTX
Scalable Machine Learning with Hadoop
PPTX
Large Scale Search, Discovery and Analytics in Action
PPTX
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
PPTX
Bet you didn't know Lucene can...
PDF
Starfish: A Self-tuning System for Big Data Analytics
PPTX
Intro to Mahout -- DC Hadoop
PPTX
Intro to Apache Lucene and Solr
PPTX
Apache Mahout: Driving the Yellow Elephant
PPTX
Intelligent Apps with Apache Lucene, Mahout and Friends
PPTX
TriHUG: Lucene Solr Hadoop
PPTX
Intro to Apache Mahout
Scalable Machine Learning with Hadoop
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Bet you didn't know Lucene can...
Starfish: A Self-tuning System for Big Data Analytics
Intro to Mahout -- DC Hadoop
Intro to Apache Lucene and Solr
Apache Mahout: Driving the Yellow Elephant
Intelligent Apps with Apache Lucene, Mahout and Friends
TriHUG: Lucene Solr Hadoop
Intro to Apache Mahout

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
MYSQL Presentation for SQL database connectivity
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MYSQL Presentation for SQL database connectivity
The AUB Centre for AI in Media Proposal.docx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Diabetes mellitus diagnosis method based random forest with bat algorithm
Electronic commerce courselecture one. Pdf
NewMind AI Monthly Chronicles - July 2025
The Rise and Fall of 3GPP – Time for a Sabbatical?

What's new in Lucene and Solr 4.x

  • 1. luceneSolr = new LuceneSolr(4.x) Grant Ingersoll CTO, LucidWorks Confidential © Copyright 2012
  • 2. Search is dead, long live search • Embrace fuzziness! • Search is a system building block • If the algorithms fit, use them! • Search use leads to search abuse - Denormalization frees your mind - Scoring is just a sparse matrix multiply http://cheezburger.com/5243950080 • Scoring features are everywhere Confidential and Proprietary 2 © 2012 LucidWorks
  • 3. Search (R)evolution • “T’ain’t your father’s search engine” - Non free text usages abound • NoSQL before NoSQL was cool - Many DB-like features • Flexibility during indexing and scoring • Finite State Transducers FTW! • Scale Confidential and Proprietary © 2012 LucidWorks
  • 4. Agenda • What’s new In Lucene 4? • What’s new in Solr 4? • Sneak Peek: what’s ahead? Confidential and Proprietary © 2012 LucidWorks
  • 5. Lucene 4 Confidential © Copyright 2012
  • 6. Up and to the Right • http://people.apache.org/~mikemccand/lucenebench/in dexing.html Confidential and Proprietary 6 © 2012 LucidWorks
  • 7. Lucene: Flexibility • Flexible Index Formats - New posting list codecs: Block, Simple Text, Append (HDFS..), etc - Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring - Decoupled from TF/IDF - Built in alternatives include BM25 & DFR » http://en.wikipedia.org/wiki/Okapi_BM25 » http://terrier.org/docs/v3.5/dfr_description.html Confidential and Proprietary © 2012 LucidWorks
  • 8. Lucene: Speed and Memory • Native Near Real Time (NRT) support - Per segment - FieldCache can be controlled to only load new segments • Soft commit - Faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) - Faster more consistent index speed • Faster fuzzy & wildcard query processing - Higher performance searching • String -> BytesRef - Much improved data structure - … means less memory and less garbage collection effort Confidential and Proprietary © 2012 LucidWorks
  • 9. BytesRef memory management improvements • On a Wikipedia index (11M documents) - Time to perform the first query with sorting (no warmup queries) Solr 3x: 13 seconds, Solr 4: 6 seconds. - Memory consumption Solr 3x: 1,040M, Solr 4: 366M. - Number of objects on the heap. Solr 3x: 19.4M, Solr 4: 80K. No, that’s not a typo. - http://searchhub.org/2012/04/06/memory-comparisons-between- solr-3x-and-trunk/ Confidential and Proprietary 9 © 2012 LucidWorks
  • 10. FuzzyQuery • http://people.apache.org/~mikemccand/lucenebench/F uzzy2.html Confidential and Proprietary 10 © 2012 LucidWorks
  • 11. QPS (primary key lookup) • http://people.apache.org/~mikemccand/lucenebench/P KLookup.html Confidential and Proprietary 11 © 2012 LucidWorks
  • 12. Lucene: Features • Doc Values • DirectSpellChecker - Store data in column order - No more sidecar index! - Tradeoffs when using vs. • Geospatial improvements FieldCache (more later) - http://searchhub.org/2013/04 /02/fun-with-docvalues-in- solr-4-2/ - http://www.slideshare.net/luc enerevolution/willnauer- simon-doc-values-column- stride-fields-in-lucene Confidential and Proprietary © 2012 LucidWorks
  • 13. Solr 4 Confidential © Copyright 2012
  • 14. Solr 4: Features • Search/Faceting/Relevance - New Relevance Function Queries (tf, df, others) - Pivot Faceting - Pseudo-join - DirectSpellChecker support - Improved Spatial (more later) • Indexing - New Update Processors, including scripting option - NRT • Other - DocTransformer pluggability - New Admin UI Confidential and Proprietary © 2012 LucidWorks
  • 15. Geospatial improvements • Multiple values per field • Index shapes other than points (circles, polygons, etc) • More complex interactions than point in a circle • Indexing: - "geo”:”43.17614,-90.57341” - “geo”:”Circle(4.56,1.23 d=0.0710)” - “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))” • Searching: - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))” Confidential and Proprietary © 2012 LucidWorks
  • 17. SolrCloud • Distributed/sharded indexing & search - Auto distributes updates and queries to appropriate shards - Near Real Time (NRT) indexing capable • Dynamically scalable - New SolrCloud instances add indexing and query capacity • Reliable - No single point of failure - Transactions logged - Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud Confidential and Proprietary © 2012 LucidWorks
  • 18. Confidential and Proprietary 18 © 2012 LucidWorks
  • 19. SolrCloud’s capabilities • Transaction log - All updates are added to the transaction log. The tlog provides support for: durability for updates that have not yet been committed, peer syncing, real-time get (retrieve documents by unique id) always up to date because it checks the tlog first, does not require opening a new searcher to see changes • Near Real Time (NRT) indexing - Soft commits make updates visible - Hard commits make updates durable • Durability - Updates to Solr may be in several different states: buffered in memory, flushed, but not committed or viewable, soft committed (flushed and viewable), committed (durable) - The transaction log ensures data is not lost in any of these states if Solr crashes. • Recovery - Solr uses the transaction log for recovery; on startup Solr checks to see if the tlog is in a committed state, if not updates since the last commit are applied • Optimistic locking - Solr maintains a document version (_version_ field); updates can now specify _version_; updates to incorrect version will fail Confidential and Proprietary © 2012 LucidWorks
  • 20. SolrCloud details • “Leaders” and “replicas” - Leaders are automatically elected • Leaders are just a replica with some coordination responsibilities for the associated replicas • If a leader goes down, one of the associated replicas is elected as the new leader • New nodes are automatically assigned a shard and role, and replicate/recover as needed • SolrJ’s CloudSolrServer • Replication in Solr 4 - Used for new and recovering replicas - Or for traditional master/slave configuration Confidential and Proprietary 20 © 2012 LucidWorks
  • 21. Solr as NoSQL • Characteristics - Non-traditional data stores - Not designed for SQL type queries - Distributed fault tolerant architecture - Document oriented, data format agnostic(JSON, XML, CSV, binary) • Updated durability via transaction log • Real-time /get fetches latest version w/o hard commit • Versioning and optimistic locking - w/ Real Time GET, allows read/write/update w/o conflicts • Atomic updates - Can add/remove/change and increment a field in existing doc w/o re-indexing Confidential and Proprietary © 2012 LucidWorks
  • 22. Distributed Key / Value Pair Database • Real-time Get combined with Solr Cloud make a very powerful key/value pair database - Durable (tlog) - Isolated (Optimistic locking) - Redundant (Solr Cloud Replicas) - Distributed & scalable (billions of keys, Solr Cloud Sharding) - Efficient Multi-tenant (Solr Cloud document routing, Solr 4.1) - Fast (milli-second response time, Pulsing Codec) - Real-time (tlog) Confidential and Proprietary 22 © 2012 LucidWorks
  • 23. Routing • Allows you to route documents and queries to a subset of shards • Provides efficient multi-tenancy • Indexing: - A shard key can be prepended to the unique document id: shard_key!unique_id - Documents with the same shard_key will reside on the same shard. • Querying: shard.keys=shard_key1!... - Much more efficient then searching the entire collection. Confidential and Proprietary 23 © 2012 LucidWorks
  • 24. Looking ahead • Automatic shard splitting • Query parsing: rich query tree control via JSON/XML • “Schemaless” - Marketing term meaning convention over configuration for fields • More programmatic control over system • Continually improving performance, scalability, and robustness Confidential and Proprietary © 2012 LucidWorks
  • 25. • Want to learn more? • Join us in San Diego April 29 – May 2, 2013 • http://lucenerevolution.org/ • http://lucenerevolution.org/2013/agenda Confidential and Proprietary 25 © 2012 LucidWorks
  • 26. Resources • Lucene/Solr - http://lucene.apache.org • Me - @gsingers, grant@lucidworks.com - http://www.manning.com/ingersoll • Company - http://www.lucidworks.com - http://www.searchhub.org - Products, Support, Training on and around Lucene and Solr Confidential and Proprietary © 2012 LucidWorks

Editor's Notes

  • #3: Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much more
  • #4: Search has added more DB features over the yearsTED: We need to introduce the idea of *REVOLUTION* somewhere in here.
  • #13: Okapi BM25 & DFR divergence from randomness
  • #18: Power users are often more likely to recoverTools for recovery:Auto-suggest, related searches, spelling suggestions
  • #22: CharacteristicsConflicts from other clients
  • #25: Power users are often more likely to recoverTools for recovery:Auto-suggest, related searches, spelling suggestions
  • #26: Thanks, LinkedIn for sponsoring!