SlideShare a Scribd company logo
What's New in Solr?


        code4lib 2011 preconference
               Bloomington, IN
presented by Erik Hatcher of Lucid Imagination
about me
spoken at several code4lib conferences

   Keynoted Athens '07 along with the pioneering Solr preconference,

   Providence '09, "Rising Sun"

   pre-conferenced Asheville '10, "Solr Black Belt"

co-authored "Lucene in Action", first edition; ghost/toast on second edition

Lucene and Solr committer.

library world claims to fame are founding and naming Blacklight, original developer on
Collex and the Rossetti Archive search

now at Lucid Imagination, dedicated to Lucene/Solr support/services/training/etc
abstract
 The library world is fired up about Solr. Practically every
next-gen catalog is using it (via Blacklight, VuFind, or other
    technologies). Solr has continued improving in some
     dramatic ways, including geospatial support, field
collapsing/grouping, extended dismax query parsing, pivot/
   grid/matrix/tree faceting, autosuggest, and more. This
 session will cover all of these new features, showcasing
 live examples of them all, including anything new that is
           implemented prior to the conference.
LIA2 - Lucene in Action
Published: July 2010 - http://www.manning.com/lucene/
New in this second edition:
   Performing hot backups
   Using numeric fields
   Tuning for indexing or searching speed
   Boosting matches with payloads
   Creating reusable analyzers
   Adding concurrency with threads
   Four new case studies, and more
Version Number
Which one ya talking 'bout, Willis?

  3.1? 4.0?? TRUNK??

playing with fire

  index format changes to be expected

     reindexing recommended/required

Solr/Lucene merged development codebases

  releases should occur lock-step moving forward
dependencies

November 2009: Solr 1.4 (Lucene 2.9.1)

June 2010: Solr 1.4.1 (Lucene 2.9.3)

Spring 2011(?): Solr 3.1 (Lucene 3.1)

TRUNK: Solr 4.x (Lucene TRUNK)
lucene
per-segment field cache, etc

Unicode and analysis improvements throughout

Analysis "attributes"

AutomatonQuery: RegexpQuery, WildcardQuery

flexible indexing

and so much more!
README

Reindex!

Upgrade SolrJ libraries too (javabin format
changed)

Read Lucene and Solr's CHANGES.txt files for all
the details
Analysis

UAX, using ICU

CollationKey

PatternReplaceCharFilter

KeywordMarkerFilterFactory,
StemmerOverrideFilterFactory
Standard tokenization

ClassicTokenizer: old StandardTokenizer

StandardTokenizer: now uses Unicode text
segmentation specified by UAX#29

UAX29URLEmailTokenizer

maxTokenLength: default=255
PathHierarchyTokenizer


delimiter: default=/

replace: default=<delimiter>

"/foo/bar" => [/foo] [/foo/bar]
CollationKeyFilter
A filter that lets one specify:

   A system collator associated with a locale, or

   A collator based on custom rules

This can be used for changing sort order for non-english languages as well as
to modify the collation sequence for certain languages. You must use the same
CollationKeyFilter at both index-time and query-time for correct results. Also,
the JVM vendor, version (including patch version) of the slave should be exactly
same as the master (or indexer) for consistent results.

http://wiki.apache.org/solr/UnicodeCollation

see also: ICUCollationKeyFilter
ICU
International Components for Unicode

ICUFoldingFilter

ICUNormalizer2Filter

  name=nfc|nfkc|nfkc_cf

  mode=compose|decompose

  filter
ICUFoldingFilter
Accent removal, case folding,canonical duplicates folding,dashes
folding,diacritic removal (including stroke, hook, descender), Greek letterforms
folding, Han Radical folding, Hebrew Alternates folding, Jamo folding,
Letterforms folding, Math symbol folding, Multigraph Expansions: All, Native
digit folding, No-break folding, Overline folding, Positional forms folding, Small
forms folding, Space folding, Spacing Accents folding, Subscript folding,
Superscript folding, Suzhou Numeral folding, Symbol folding, Underline folding,
Vertical forms folding, Width folding

Additionally, Default Ignorables are removed, and text is normalized to NFKC.

 All foldings, case folding, and normalization mappings are applied recursively
to ensure a fully folded and normalized result.
ICUTransformFilter
id: specific transliterator identifier from ICU's
Transliterator#getAvailableIDs()(required)

direction=forward|reverse

Examples:

  Traditional-Simplified:         =>

  Cyrillic-Latin: Российская Федерация =>
  Rossijskaâ Federaciâ
Tom Burton-West's
latest

ICU

shingles

query parser

ABC -> [A] [B] [C] or [AB] [BC]...
highlighter


deprecated old config, now config as standard
search component

FastVectorHighlighter
FastVectorHighlighter

if termVectors="true", termPositions="true", and
termOffsets="true"

and hl.useFastVectorHighlighter=true

  hl.fragListBuilder

  hl.fragmentsBuilder
spatial
JTeam's plugin: packaged for easy deployment

Solr trunk capabilities

many distance functions

What's missing?

  geo faceting? scoring by distance? distance
  pseudo-field?

All units in kilometers, unless otherwise specified
Spatial field types

Point: n-dimensional, must specify dimension
(default=2), represented by N subfields internally

LatLon: latitude,longitude, represented by two
subfields internally, single valued only

GeoHash: single string representation of lat/lon
Spatial query parsers
geofilt: exact filtering

bbox: uses (trie) range queries

Parameters:

  sfield: spatial field

  pt: reference point

  d: distance
field collapsing/grouping
backwards compatibility mode?               sort: how to sort groups, by top
                                            document in each group
http://wiki.apache.org/solr/
FieldCollapsing                             group.sort: how to sort docs within
                                            each group
group=true
                                            group.format: grouped | simple
group.field / group.func / group.query
                                            group.main=true|false:
rows / start: for groups, not documents
                                            faceting works as normal
group.limit: number of results per
group                                       not distributed savvy yet

group.offset: offset into doclist of each
group
query parsing


TextField: autoGeneratePhraseQueries="true"

  if single string analyzes to multiple tokens
{!raw|term|field f=$f}...
Recall why we needed {!raw} from last year

<fieldType = .../> - use one string, one numeric, (and one text?)

<field name="..."/>

table for numeric and for string (and text?):

    {!raw f=$f} | TermQuery(...)

    {!term f=$f} | ...

    {!field f=$f} | ...

Which to use when? {!raw} works for strings just fine, but best to migrate to the generally
safer/wiser {!term} for future-proofing.
{!term f=field}


fq={!term f=weight}1.5
dismax

q.op or schema.xml's <solrQueryParser
defaultOperator="[AND|OR]"/> defaults mm to 0%
(OR) or 100% (AND)

#code4lib: issues with non-analyzed fields in qf
edismax
Supports full lucene query syntax in the absence of syntax errors


supports "and"/"or" to mean "AND"/"OR" in lucene syntax mode


When there are syntax errors, improved smart partial escaping of special characters is done to prevent
them... in this mode, fielded queries, +/-, and phrase queries are still supported.


Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in
the document to get any boost, as well as having all of the words in a single field.


advanced stopword handling... stopwords are not required in the mandatory part of the query but are still
used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be)
then all will be required.


Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead of
adding it in


Supports pure negative nested queries... so a query like +foo (-foo) will match all documents
function queries


termfreq, tf, docfreq, idf, norm, maxdoc, numdocs

{!func}termfreq(text,ipod)

standard java.util.Math functions
faceting
per-segment, single-valued fields:

   facet.method=fcs (field cache per segment)

   facet.field={!threads=-1}field_name

       threads=0: direct execution

       threads=-1: thread per segment

   speeds up single and multivalued method=fc, especially for deep paging with
   facet.offset

date faceting improvements, generalized for numeric ranges too

can now exclude main query q={!tag=main}the+query&facet.field={!ex=main}category
pivot/grid/matrix/tree
faceting


is this also "hierarchical faceting"? it depends!
pivot faceting output
/select?q=*:*&rows=0&facet=on
&facet.pivot=cat,popularity,inStock
&facet.pivot=popularity,cat
spell checking


DirectSolrSpellChecker

  no external index needed, uses automaton on
  main index
spellcheck config
solrconfig.xml
  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">

    <str name="queryAnalyzerFieldType">textgen</str>

    <!-- a spellchecker that uses no auxiliary index -->
     <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">spell</str>
      <str name="classname">solr.DirectSolrSpellChecker</str>
      <str name="minPrefix">1</str>
    </lst>
  </searchComponent>
spellcheck handler

solrconfig.xml
  <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <str name="spellcheck">true</str>
       <str name="spellcheck.collate">true</str>
     </lst>

    <arr name="last-components">
      <str>spellcheck</str>
    </arr>
  </requestHandler>
spellcheck response
http://localhost:8983/solr/select?q=ipud%20bluck&wt=ruby&indent=on
                        {
                            'responseHeader'=>{
                               'status'=>0,
                               'QTime'=>10,
                               'params'=>{
                                 'indent'=>'on',
                                 'wt'=>'ruby',
                                 'q'=>'ipud bluck'}},
                            'response'=>{'numFound'=>0,'start'=>0,'docs'=>[]
                            },
                            'spellcheck'=>{
                               'suggestions'=>[
                                 'ipud',{
                                   'numFound'=>1,
                                   'startOffset'=>0,
                                   'endOffset'=>4,
                                   'suggestion'=>['ipod']},
                                 'bluck',{
                                   'numFound'=>1,
                                   'startOffset'=>5,
                                   'endOffset'=>10,
                                   'suggestion'=>['black']},
                                 'collation','ipod black']}}
autosuggest

new "spellcheck" component, builds TST

collates query

can check if collated suggestions yield results,
optionally, providing hit count
suggest config
solrconfig.xml
  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">

    <str name="queryAnalyzerFieldType">textgen</str>


    <lst name="spellchecker">
      <str name="name">suggest</str>
      <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
      <str name="lookupImpl">
        org.apache.solr.spelling.suggest.jaspell.JaspellLookup
      </str>
      <str name="field">suggest</str>
      <str name="buildOnCommit">true</str>
    </lst>
  </searchComponent>

schema.xml
   <field name="suggest" type="textgen" indexed="true" stored="false"/>

   <copyField source="name" dest="suggest"/>
suggest handler
solrconfig.xml
  <requestHandler class="solr.SearchHandler" name="/suggest">
    <lst name="defaults">
      <str name="spellcheck">true</str>
      <str name="spellcheck.dictionary">suggest</str>
      <str name="spellcheck.collate">true</str>
      <str name="spellcheck.count">10</str>
      <str name="rows">0</str>
      <str name="spellcheck.maxCollationTries">20</str>
      <str name="spellcheck.maxCollations">10</str>
      <str name="spellcheck.collateExtendedResults">true</str>
    </lst>
    <arr name="components">
      <str>query</str> <!-- to allow suggestion hit counts to be returned -->
      <str>spellcheck</str>
    </arr>
  </requestHandler>
suggest response
http://localhost:8983/solr/suggest?q=ip&wt=ruby&indent=on
         {
             'responseHeader'=>{
                'status'=>0,
                'QTime'=>2},
             'response'=>{'numFound'=>0,'start'=>0,'docs'=>[]
             },
             'spellcheck'=>{
                'suggestions'=>[
                  'ip',{
                    'numFound'=>1,
                    'startOffset'=>0,
                    'endOffset'=>2,
                    'suggestion'=>['ipod']},
                  'collation',[
                    'collationQuery','ipod',
                    'hits',3,
                    'misspellingsAndCorrections',[
                      'ip','ipod']]]}}
sort

by function

  &q=*:*&sfield=store&pt=39.194564,-86.432947&
  sort=geodist() asc

but still can't get value of function back

  unless you force it to be the score somehow
clustering component


now works out-of-the-box; all Apache license
compatible

supports distributed search
debug=true


debug=true|all|timing|query|results

debug=results&debug.explain.structured=true
structured explain
http://localhost:8983/solr/select?q=title:solr
&debug.explain.structured=true&debug=results
&wt=ruby&indent=on
     'debug'=>{
       'explain'=>{
         'doc1'=>{
           'match'=>true,
           'value'=>0.076713204,
           'description'=>'fieldWeight(title:solr in 0), product of:',
           'details'=>[{
                'match'=>true,
                'value'=>1.0,
                'description'=>'tf(termFreq(title:solr)=1)'},
             {
                'match'=>true,
                'value'=>0.30685282,
                'description'=>'idf(docFreq=1, maxDocs=1)'},
             {
                'match'=>true,
                'value'=>0.25,
                'description'=>'fieldNorm(field=title, doc=0)'}]}}}}
SolrCloud

shared/central config and core/shard managment
via zookeeper,

built-in load balancing, and infrastructure for future
SolrCloud work.
/update/json
solrconfig.xml
  <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/>




     curl
         'http://localhost:8983/solr/update/json?commit=true'
         -H 'Content-type:application/json' -d '
     {
       "add": {
         "doc": {
           "id" : "MyTestDocument",
           "title" : "This is just a test"
         }
       }
     }'
wt=csv

Writes only docs (no response header or response
extras) in CSV format

Roundtrippable with /update/csv

  provided all fields are stored
UIMA
Unstructured Information Management
Architecture

 http://uima.apache.org/

New update processor chain, augmenting
incoming documents from a UIMA annotator
pipeline

 http://wiki.apache.org/solr/SolrUIMA
(solr|lucene)-dev


ant [idea|eclipse]

go!

http://wiki.apache.org/solr/HowToContribute
works in progress

some interesting open issues (with patches):

  PayloadTermQuery

  XMLQueryParser plugin

  join
{!join from=$f to=$t}


insert <what Yonik said>

  https://issues.apache.org/jira/browse/
  SOLR-2272
Lucid (imagination)
What's Lucid done for you lately -

  Yonik, Mark, Grant, Hoss: Lucene and Solr performance,
  faceting, grouping, join query, spatial, Mahout, ORP, PMC,
  etc, etc, etc

  Other technical staff involved in mailing list assistance, bug
  reporting, contributing patches (hi Lance, Erick, Jay, Tom,
  Grijesh, Tomas....)

  extended dismax, join, faceting performance improvements

  LucidWorks Enterprise
Hoss Simplicity

http://www.lucidimagination.com/blog/
2011/01/21/solr-powered-isfdb-part1/

http://www.lucidimagination.com/blog/
2011/01/28/solr-powered-isfdb-part-2/
LucidWorks Enterprise
      "lucid" query parser               REST API

      click boosting                     Data sources,
                                         crawlers, and
      tunable norms, per-                scheduling
      field
                                         Alerts
      role filtering

      administrative UI

http://www.lucidimagination.com/enterprise-search-solutions/lucidworks
Community Questions


fire away!
resources


duh!: #code4lib

lucene.apache.org/solr

search.lucidimagination.com/?q=<your query>
Q&A: faceting


why is paging through facets the way it is?

  short-circuits on enum
Community:
- The state of Extended DisMax, and what Lucene features
remain incompatible with it.

- Any developments on faceting (I've implemented the
standard workaround to the "unknown facet list size"
problem...  but I'd still love to be able to know exactly how
long the lists are)

- Hierarchical documents in Solr -- I haven't followed the
conversations closely, but I gather that this topic is gaining
some momentum in the Solr community.
contact info
erik.hatcher @ lucidimagination . com

http://www.lucidimagination.com

  webinars, documentation

  LucidFind: search.lucidimagination.com

    search mailing list posts, wiki pages, web
    sites, our blog, etc for latest Lucene/Solr
    assistance
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
re: code4lib

More Related Content

PDF
Rapid Prototyping with Solr
PDF
Lucene's Latest (for Libraries)
PDF
Lucene for Solr Developers
PDF
Solr 4
PDF
Solr Black Belt Pre-conference
PDF
Lucene for Solr Developers
PDF
Solr Recipes Workshop
PDF
Rapid Prototyping with Solr
Rapid Prototyping with Solr
Lucene's Latest (for Libraries)
Lucene for Solr Developers
Solr 4
Solr Black Belt Pre-conference
Lucene for Solr Developers
Solr Recipes Workshop
Rapid Prototyping with Solr

What's hot (20)

PDF
Solr Query Parsing
PDF
Apache Solr Workshop
PDF
Solr Recipes
PDF
Building your own search engine with Apache Solr
PDF
Integrating the Solr search engine
PDF
Rapid Prototyping with Solr
PDF
Solr Application Development Tutorial
PDF
Get the most out of Solr search with PHP
PPTX
Apache Solr
PPT
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
PPTX
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
PPTX
Tutorial on developing a Solr search component plugin
PDF
Using Apache Solr
PDF
Solr Troubleshooting - TreeMap approach
PDF
Apache solr liferay
PDF
Introduction to Solr
PPTX
Solr 6 Feature Preview
PDF
Lucene for Solr Developers
PDF
Solr Indexing and Analysis Tricks
PDF
Solr Powered Lucene
Solr Query Parsing
Apache Solr Workshop
Solr Recipes
Building your own search engine with Apache Solr
Integrating the Solr search engine
Rapid Prototyping with Solr
Solr Application Development Tutorial
Get the most out of Solr search with PHP
Apache Solr
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Tutorial on developing a Solr search component plugin
Using Apache Solr
Solr Troubleshooting - TreeMap approach
Apache solr liferay
Introduction to Solr
Solr 6 Feature Preview
Lucene for Solr Developers
Solr Indexing and Analysis Tricks
Solr Powered Lucene
Ad

Similar to code4lib 2011 preconference: What's New in Solr (since 1.4.1) (20)

PDF
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
PDF
What's New in Solr 3.x / 4.0
PDF
Find it, possibly also near you!
PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
PDF
PPTX
Solr Introduction
PDF
Apache Solr crash course
PDF
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
PPTX
Apache solr
PDF
Search Engine-Building with Lucene and Solr
PDF
Solr Masterclass Bangkok, June 2014
PDF
Query Parsing - Tips and Tricks
PDF
Oslo Solr MeetUp March 2012 - Solr4 alpha
PPTX
Apache Solr
PDF
"Solr Update" at code4lib '13 - Chicago
PDF
Solr 3.1 and beyond
PPTX
Building Search & Recommendation Engines
PPTX
Introduction to Apache Solr
PPTX
20130310 solr tuorial
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
What's New in Solr 3.x / 4.0
Find it, possibly also near you!
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Solr Introduction
Apache Solr crash course
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Apache solr
Search Engine-Building with Lucene and Solr
Solr Masterclass Bangkok, June 2014
Query Parsing - Tips and Tricks
Oslo Solr MeetUp March 2012 - Solr4 alpha
Apache Solr
"Solr Update" at code4lib '13 - Chicago
Solr 3.1 and beyond
Building Search & Recommendation Engines
Introduction to Apache Solr
20130310 solr tuorial
Ad

More from Erik Hatcher (11)

PDF
Ted Talk
PDF
Solr Payloads
PDF
it's just search
PDF
Solr Powered Libraries
PDF
Introduction to Solr
PDF
Solr Flair
PDF
Introduction to Solr
PDF
Lucene for Solr Developers
PDF
Rapid Prototyping with Solr
PDF
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
PDF
Solr Flair: Search User Interfaces Powered by Apache Solr
Ted Talk
Solr Payloads
it's just search
Solr Powered Libraries
Introduction to Solr
Solr Flair
Introduction to Solr
Lucene for Solr Developers
Rapid Prototyping with Solr
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Solr Flair: Search User Interfaces Powered by Apache Solr

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Weekly Chronicles - August'25 Week I
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
The AUB Centre for AI in Media Proposal.docx
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
Unlocking AI with Model Context Protocol (MCP)
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation_ Review paper, used for researhc scholars
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Monthly Chronicles - July 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

code4lib 2011 preconference: What's New in Solr (since 1.4.1)

  • 1. What's New in Solr? code4lib 2011 preconference Bloomington, IN presented by Erik Hatcher of Lucid Imagination
  • 2. about me spoken at several code4lib conferences Keynoted Athens '07 along with the pioneering Solr preconference, Providence '09, "Rising Sun" pre-conferenced Asheville '10, "Solr Black Belt" co-authored "Lucene in Action", first edition; ghost/toast on second edition Lucene and Solr committer. library world claims to fame are founding and naming Blacklight, original developer on Collex and the Rossetti Archive search now at Lucid Imagination, dedicated to Lucene/Solr support/services/training/etc
  • 3. abstract The library world is fired up about Solr. Practically every next-gen catalog is using it (via Blacklight, VuFind, or other technologies). Solr has continued improving in some dramatic ways, including geospatial support, field collapsing/grouping, extended dismax query parsing, pivot/ grid/matrix/tree faceting, autosuggest, and more. This session will cover all of these new features, showcasing live examples of them all, including anything new that is implemented prior to the conference.
  • 4. LIA2 - Lucene in Action Published: July 2010 - http://www.manning.com/lucene/ New in this second edition: Performing hot backups Using numeric fields Tuning for indexing or searching speed Boosting matches with payloads Creating reusable analyzers Adding concurrency with threads Four new case studies, and more
  • 5. Version Number Which one ya talking 'bout, Willis? 3.1? 4.0?? TRUNK?? playing with fire index format changes to be expected reindexing recommended/required Solr/Lucene merged development codebases releases should occur lock-step moving forward
  • 6. dependencies November 2009: Solr 1.4 (Lucene 2.9.1) June 2010: Solr 1.4.1 (Lucene 2.9.3) Spring 2011(?): Solr 3.1 (Lucene 3.1) TRUNK: Solr 4.x (Lucene TRUNK)
  • 7. lucene per-segment field cache, etc Unicode and analysis improvements throughout Analysis "attributes" AutomatonQuery: RegexpQuery, WildcardQuery flexible indexing and so much more!
  • 8. README Reindex! Upgrade SolrJ libraries too (javabin format changed) Read Lucene and Solr's CHANGES.txt files for all the details
  • 10. Standard tokenization ClassicTokenizer: old StandardTokenizer StandardTokenizer: now uses Unicode text segmentation specified by UAX#29 UAX29URLEmailTokenizer maxTokenLength: default=255
  • 12. CollationKeyFilter A filter that lets one specify: A system collator associated with a locale, or A collator based on custom rules This can be used for changing sort order for non-english languages as well as to modify the collation sequence for certain languages. You must use the same CollationKeyFilter at both index-time and query-time for correct results. Also, the JVM vendor, version (including patch version) of the slave should be exactly same as the master (or indexer) for consistent results. http://wiki.apache.org/solr/UnicodeCollation see also: ICUCollationKeyFilter
  • 13. ICU International Components for Unicode ICUFoldingFilter ICUNormalizer2Filter name=nfc|nfkc|nfkc_cf mode=compose|decompose filter
  • 14. ICUFoldingFilter Accent removal, case folding,canonical duplicates folding,dashes folding,diacritic removal (including stroke, hook, descender), Greek letterforms folding, Han Radical folding, Hebrew Alternates folding, Jamo folding, Letterforms folding, Math symbol folding, Multigraph Expansions: All, Native digit folding, No-break folding, Overline folding, Positional forms folding, Small forms folding, Space folding, Spacing Accents folding, Subscript folding, Superscript folding, Suzhou Numeral folding, Symbol folding, Underline folding, Vertical forms folding, Width folding Additionally, Default Ignorables are removed, and text is normalized to NFKC. All foldings, case folding, and normalization mappings are applied recursively to ensure a fully folded and normalized result.
  • 15. ICUTransformFilter id: specific transliterator identifier from ICU's Transliterator#getAvailableIDs()(required) direction=forward|reverse Examples: Traditional-Simplified: => Cyrillic-Latin: Российская Федерация => Rossijskaâ Federaciâ
  • 17. highlighter deprecated old config, now config as standard search component FastVectorHighlighter
  • 18. FastVectorHighlighter if termVectors="true", termPositions="true", and termOffsets="true" and hl.useFastVectorHighlighter=true hl.fragListBuilder hl.fragmentsBuilder
  • 19. spatial JTeam's plugin: packaged for easy deployment Solr trunk capabilities many distance functions What's missing? geo faceting? scoring by distance? distance pseudo-field? All units in kilometers, unless otherwise specified
  • 20. Spatial field types Point: n-dimensional, must specify dimension (default=2), represented by N subfields internally LatLon: latitude,longitude, represented by two subfields internally, single valued only GeoHash: single string representation of lat/lon
  • 21. Spatial query parsers geofilt: exact filtering bbox: uses (trie) range queries Parameters: sfield: spatial field pt: reference point d: distance
  • 22. field collapsing/grouping backwards compatibility mode? sort: how to sort groups, by top document in each group http://wiki.apache.org/solr/ FieldCollapsing group.sort: how to sort docs within each group group=true group.format: grouped | simple group.field / group.func / group.query group.main=true|false: rows / start: for groups, not documents faceting works as normal group.limit: number of results per group not distributed savvy yet group.offset: offset into doclist of each group
  • 23. query parsing TextField: autoGeneratePhraseQueries="true" if single string analyzes to multiple tokens
  • 24. {!raw|term|field f=$f}... Recall why we needed {!raw} from last year <fieldType = .../> - use one string, one numeric, (and one text?) <field name="..."/> table for numeric and for string (and text?): {!raw f=$f} | TermQuery(...) {!term f=$f} | ... {!field f=$f} | ... Which to use when? {!raw} works for strings just fine, but best to migrate to the generally safer/wiser {!term} for future-proofing.
  • 26. dismax q.op or schema.xml's <solrQueryParser defaultOperator="[AND|OR]"/> defaults mm to 0% (OR) or 100% (AND) #code4lib: issues with non-analyzed fields in qf
  • 27. edismax Supports full lucene query syntax in the absence of syntax errors supports "and"/"or" to mean "AND"/"OR" in lucene syntax mode When there are syntax errors, improved smart partial escaping of special characters is done to prevent them... in this mode, fielded queries, +/-, and phrase queries are still supported. Improved proximity boosting via word bigrams... this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field. advanced stopword handling... stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required. Supports the "boost" parameter.. like the dismax bf param, but multiplies the function query instead of adding it in Supports pure negative nested queries... so a query like +foo (-foo) will match all documents
  • 28. function queries termfreq, tf, docfreq, idf, norm, maxdoc, numdocs {!func}termfreq(text,ipod) standard java.util.Math functions
  • 29. faceting per-segment, single-valued fields: facet.method=fcs (field cache per segment) facet.field={!threads=-1}field_name threads=0: direct execution threads=-1: thread per segment speeds up single and multivalued method=fc, especially for deep paging with facet.offset date faceting improvements, generalized for numeric ranges too can now exclude main query q={!tag=main}the+query&facet.field={!ex=main}category
  • 30. pivot/grid/matrix/tree faceting is this also "hierarchical faceting"? it depends!
  • 32. spell checking DirectSolrSpellChecker no external index needed, uses automaton on main index
  • 33. spellcheck config solrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textgen</str> <!-- a spellchecker that uses no auxiliary index --> <lst name="spellchecker"> <str name="name">default</str> <str name="field">spell</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="minPrefix">1</str> </lst> </searchComponent>
  • 34. spellcheck handler solrconfig.xml <requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="spellcheck">true</str> <str name="spellcheck.collate">true</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
  • 35. spellcheck response http://localhost:8983/solr/select?q=ipud%20bluck&wt=ruby&indent=on { 'responseHeader'=>{ 'status'=>0, 'QTime'=>10, 'params'=>{ 'indent'=>'on', 'wt'=>'ruby', 'q'=>'ipud bluck'}}, 'response'=>{'numFound'=>0,'start'=>0,'docs'=>[] }, 'spellcheck'=>{ 'suggestions'=>[ 'ipud',{ 'numFound'=>1, 'startOffset'=>0, 'endOffset'=>4, 'suggestion'=>['ipod']}, 'bluck',{ 'numFound'=>1, 'startOffset'=>5, 'endOffset'=>10, 'suggestion'=>['black']}, 'collation','ipod black']}}
  • 36. autosuggest new "spellcheck" component, builds TST collates query can check if collated suggestions yield results, optionally, providing hit count
  • 37. suggest config solrconfig.xml <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textgen</str> <lst name="spellchecker"> <str name="name">suggest</str> <str name="classname">org.apache.solr.spelling.suggest.Suggester</str> <str name="lookupImpl"> org.apache.solr.spelling.suggest.jaspell.JaspellLookup </str> <str name="field">suggest</str> <str name="buildOnCommit">true</str> </lst> </searchComponent> schema.xml <field name="suggest" type="textgen" indexed="true" stored="false"/> <copyField source="name" dest="suggest"/>
  • 38. suggest handler solrconfig.xml <requestHandler class="solr.SearchHandler" name="/suggest"> <lst name="defaults"> <str name="spellcheck">true</str> <str name="spellcheck.dictionary">suggest</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.count">10</str> <str name="rows">0</str> <str name="spellcheck.maxCollationTries">20</str> <str name="spellcheck.maxCollations">10</str> <str name="spellcheck.collateExtendedResults">true</str> </lst> <arr name="components"> <str>query</str> <!-- to allow suggestion hit counts to be returned --> <str>spellcheck</str> </arr> </requestHandler>
  • 39. suggest response http://localhost:8983/solr/suggest?q=ip&wt=ruby&indent=on { 'responseHeader'=>{ 'status'=>0, 'QTime'=>2}, 'response'=>{'numFound'=>0,'start'=>0,'docs'=>[] }, 'spellcheck'=>{ 'suggestions'=>[ 'ip',{ 'numFound'=>1, 'startOffset'=>0, 'endOffset'=>2, 'suggestion'=>['ipod']}, 'collation',[ 'collationQuery','ipod', 'hits',3, 'misspellingsAndCorrections',[ 'ip','ipod']]]}}
  • 40. sort by function &q=*:*&sfield=store&pt=39.194564,-86.432947& sort=geodist() asc but still can't get value of function back unless you force it to be the score somehow
  • 41. clustering component now works out-of-the-box; all Apache license compatible supports distributed search
  • 43. structured explain http://localhost:8983/solr/select?q=title:solr &debug.explain.structured=true&debug=results &wt=ruby&indent=on 'debug'=>{ 'explain'=>{ 'doc1'=>{ 'match'=>true, 'value'=>0.076713204, 'description'=>'fieldWeight(title:solr in 0), product of:', 'details'=>[{ 'match'=>true, 'value'=>1.0, 'description'=>'tf(termFreq(title:solr)=1)'}, { 'match'=>true, 'value'=>0.30685282, 'description'=>'idf(docFreq=1, maxDocs=1)'}, { 'match'=>true, 'value'=>0.25, 'description'=>'fieldNorm(field=title, doc=0)'}]}}}}
  • 44. SolrCloud shared/central config and core/shard managment via zookeeper, built-in load balancing, and infrastructure for future SolrCloud work.
  • 45. /update/json solrconfig.xml <requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler"/> curl 'http://localhost:8983/solr/update/json?commit=true' -H 'Content-type:application/json' -d ' { "add": { "doc": { "id" : "MyTestDocument", "title" : "This is just a test" } } }'
  • 46. wt=csv Writes only docs (no response header or response extras) in CSV format Roundtrippable with /update/csv provided all fields are stored
  • 47. UIMA Unstructured Information Management Architecture http://uima.apache.org/ New update processor chain, augmenting incoming documents from a UIMA annotator pipeline http://wiki.apache.org/solr/SolrUIMA
  • 49. works in progress some interesting open issues (with patches): PayloadTermQuery XMLQueryParser plugin join
  • 50. {!join from=$f to=$t} insert <what Yonik said> https://issues.apache.org/jira/browse/ SOLR-2272
  • 51. Lucid (imagination) What's Lucid done for you lately - Yonik, Mark, Grant, Hoss: Lucene and Solr performance, faceting, grouping, join query, spatial, Mahout, ORP, PMC, etc, etc, etc Other technical staff involved in mailing list assistance, bug reporting, contributing patches (hi Lance, Erick, Jay, Tom, Grijesh, Tomas....) extended dismax, join, faceting performance improvements LucidWorks Enterprise
  • 53. LucidWorks Enterprise "lucid" query parser REST API click boosting Data sources, crawlers, and tunable norms, per- scheduling field Alerts role filtering administrative UI http://www.lucidimagination.com/enterprise-search-solutions/lucidworks
  • 56. Q&A: faceting why is paging through facets the way it is? short-circuits on enum
  • 57. Community: - The state of Extended DisMax, and what Lucene features remain incompatible with it. - Any developments on faceting (I've implemented the standard workaround to the "unknown facet list size" problem...  but I'd still love to be able to know exactly how long the lists are) - Hierarchical documents in Solr -- I haven't followed the conversations closely, but I gather that this topic is gaining some momentum in the Solr community.
  • 58. contact info erik.hatcher @ lucidimagination . com http://www.lucidimagination.com webinars, documentation LucidFind: search.lucidimagination.com search mailing list posts, wiki pages, web sites, our blog, etc for latest Lucene/Solr assistance