Lucene And Solr Intro

1. Lucene And Solr Introduction By Pascal Dimassimo [email_address]

2. About me Java developers with 10+ years of experience

3. Working for OpenText/Nstein on Semantic Navigation application

4. http://semanticnavigation.opentext.com/

5. History Lucene launches in 2000

6. Solr launches in 2006

7. Lucid Imagination in 2009 Hire the core developers of Lucene and Solr

8. Offer commercial support Lucene Revolution in 2010

9. Buzz According to IDC “53% of companies using Open source use Lucene”

10. “Largely responsible for significant decline in commercial OEM revenue” Source http://lucenerevolution.com/sites/default/files/slides/Lucene%20Rev%20Preso%20IDC_MarketTrends_Reynolds.pdf

11. Lucene? “Lucene is a powerful Java search library that lets you easily add search to any application” - Lucene In Action 2 nd Edition

12. NOT an application

13. Text indexing and searching

14. Open Source

15. Mature

16. Easy to learn API

17. Typical Search App Taken from Lucene In Action 2 nd Edition Lucene

18. Search? Naive approach: linear-search (à la grep)

19. O(n) -> Slow...

20. You want to find a word in a book: how do you do it?

21. Inverted Index

22. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )

23. Inverted Index Original Slide from Michael Busch (available at http://goo.gl/0MQvy )

24. Lucene Document FSDirectory dir = FSDirectory. open ( new File( "./index" )); SimpleAnalyzer analyzer = new SimpleAnalyzer(); MaxFieldLength len = IndexWriter.MaxFieldLength. UNLIMITED ; IndexWriter writer = new IndexWriter(dir, analyzer, true , len); String content = "The old night keeper keeps the keep in the town" ; Document doc = new Document(); doc.add( new Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); writer.addDocument(doc); writer.commit();

25. Lucene Document Document: what is returned as search result

26. Organized in fields. A field must be specified at query time!

27. Schema-less

28. Plain text

29. Fields Indexed: put the content in the inverted index.

30. Analyzed: split the content into terms to be added to the inverted index. Normalized terms.

31. Stored: Keep the original content on disk

32. Multivalued: Repeat the same field multiple times in the same document with different values

33. Lucene Document String content = "The old night keeper keeps the keep in the town" ; String author = "Peter Smith" ; String category1 = "Fiction" ; String category2 = "Canadian" ; String isbn = "978-1-933988-17-7" ; String id = "ABY123" ; Document doc = new Document(); doc.add( new Field( "content" , content, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "author" , author, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "category" , category1, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "category" , category2, Field.Store. YES , Field.Index. ANALYZED )); doc.add( new Field( "isbn" , isbn, Field.Store. YES , Field.Index. NOT_ANALYZED )); doc.add( new Field( "id" , id, Field.Store. YES , Field.Index. NO )); writer.addDocument(doc); writer.commit();

34. Lucene Demo Indexing unit tests written by David

35. Relevancy How to you tell which document is more important?

36. Vectorial Model N dimension vectors for documents and queries

37. Score represents how close the vectors are

38. Tf-idf (term frequency–inverse document frequency)

39. Documents with many of the search terms are scored higher

40. Smaller documents are scored higher

41. Analyzer Taken from Lucene In Action 2 nd Edition

42. Analyzer Convert text into terms

43. Used when indexing and querying

44. Tokenizer + Filters

45. Custom analyzers

46. Analyzer "The quick brown fox jumped over the lazy dog" WhitespaceAnalyzer [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] SimpleAnalyzer [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog] StopAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] StandardAnalyzer [quick] [brown] [fox] [jumped] [over] [lazy] [dog] Example from Lucene In Action 2 nd Edition

47. Analyzer "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer [xy&z] [corporation] [xyz@example.com] Example from Lucene In Action 2 nd Edition

48. Custom Analyzers WhitespaceTokenizer Tokenize at white spaces KeywordTokenizer Tokenize input as a single token StandardTokenizer Tokenize at white spaces but keeping high-level entity as token (email, etc TODO) LowerCaseFilter Lowercases token text StopFilter Removes words that exist in a provided set of words PorterStemFilter Stems each token using the Porter stemming algorithm. For example, country and countries both stem to countri . Some descriptions from Lucene In Action 2 nd Edition

49. Query Asking Lucene “what documents contain this word?”

50. Lucene applied an Analyzer to each word queried

51. Query can be programmatically build

52. Powerful Query Syntax

53. Query code SimpleAnalyzer analyzer = new SimpleAnalyzer(); QueryParser parser = new QueryParser(Version. LUCENE_30 , "content" , analyzer); Query query = parser.parse( "big" ); TopDocs docs = searcher.search(query, 10);

54. Query Syntax: Basic title:montreal text field

55. Query Syntax: Range name:[a TO k] range field

56. Query Syntax: Boolean title:(java AND programming) operator field

57. Query Syntax: Boolean title:java OR name:pascal operator field field

58. Query Syntax: Phrase title:”Lucene in Action” phrase field

59. Query Syntax: Wildcard title:program* Term prefix field

60. Lucene Demo Searching unit tests written by David

61. Lucene summary Inverted index for fast document retrieval

62. Relevancy scoring

63. Powerful query syntax

64. Solr Created by Yonik Seeley in 2004 and released as open source in 2006

65. HTTP application built around Lucene

66. Makes it easy to develop search solutions

67. Advanced features develop on top of Lucene

68. As of 2010, Lucene and Solr are merged

69. Solr Schema Solr allows to administer one or more Lucene indexes

70. Each index has its own schema

71. Lists all fields allowed for an index

72. Defines the analyzers for each field

73. Solr Schema < field name = "id" type = "string" indexed = "true" stored = "true" required = "true" /> < field name = "title" type = "text" indexed = "true" stored = "true" /> < field name = "presenter" type = "text_ws" indexed = "true" stored = "true" /> < field name = "date" type = "date" indexed = "true" stored = "true" /> < field name = "abstract" type = "text" indexed = "true" stored = "true" />

74. Solr Schema < fieldType name = "text" class = "solr.TextField" positionIncrementGap = "100" > < analyzer type = "index" > < tokenizer class = "solr.WhitespaceTokenizerFactory" /> < filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "stopwords.txt" /> < filter class = "solr.LowerCaseFilterFactory" /> < filter class = "solr.ISOLatin1AccentFilterFactory" /> < filter class = "solr.SnowballPorterFilterFactory" language = "English" protected = "protwords.txt" /> </ analyzer > < analyzer type = "query" > < tokenizer class = "solr.WhitespaceTokenizerFactory" /> < filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "stopwords.txt" /> < filter class = "solr.LowerCaseFilterFactory" /> < filter class = "solr.ISOLatin1AccentFilterFactory" /> < filter class = "solr.SnowballPorterFilterFactory" language = "English" protected = "protwords.txt" /> </ analyzer > </ fieldType >

75. Solr Schema < fieldType name = "text_ws" class = "solr.TextField" positionIncrementGap = "100" > < analyzer type = "index" > < tokenizer class = "solr.WhitespaceTokenizerFactory" /> < filter class = "solr.LowerCaseFilterFactory" /> </ analyzer > < analyzer type = "query" > < tokenizer class = "solr.WhitespaceTokenizerFactory" /> < filter class = "solr.LowerCaseFilterFactory" /> </ analyzer > </ fieldType >

76. Solr Indexation HTTP POST

77. XML by default, but also CSV

78. Multi threaded

79. Advanced features: binary document extraction, DB plugin

80. Solr Indexation < add > < doc > < field name = "id" > 002 </ field > < field name = "title" > Lucene And Solr Introduction </ field > < field name = "presenter" > Pascal Dimassimo </ field > < field name = "date" > 2010-11-18T00:00:00Z </ field > < field name = "abstract" > ... </ field > </ doc > <doc>...</doc> </ add > curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary @add.xml

81. Solr Query HTTP GET

82. Query Parameters

83. Response in XML by default, but other formats are supported (json, php, ruby)

84. Solr Query curl http://localhost:8983/solr/select?q=title:Lucene < response > < lst name = "responseHeader" > < int name = "status" > 0 </ int > < int name = "QTime" > 269 </ int > < lst name = "params" > < str name = "q" > title:Lucene </ str > </ lst > </ lst > < result name = "response" numFound = "1" start = "0" > < doc > < str name = "id" > 002 </ str > < str name = "title" > Lucene And Solr Introduction </ str > < str name = "presenter" > Pascal Dimassimo </ str > < date name = "date" > 2010-11-18T00:00:00Z </ date > < str name = "abstract" > ... </ str > </ doc > </ result > </ response >

85. Solr Query Parameters q Lucene Query sort Field to sort on. Defaut to score start Offset for the results page to display. Default 0 rows Numbers of results to display per page. Default 10 fq Filter Query. Default to all documents fl List of fields to display per document. Default to all fields wt Format to display result. Default to xml

86. Solr Facets For a query results, list of all distinct indexed values of a field with their frequencies

87. Useful for drilling down in results set

88. SolrJ Library to connect and interact with Solr String url = "http://localhost:8983/solr" ; CommonsHttpSolrServer server = new CommonsHttpSolrServer(url); SolrInputDocument doc = new SolrInputDocument(); doc.addField( "id" , "id1" , 1.0f); doc.addField( "name" , "doc1" , 1.0f); doc.addField( "price" , 10); server.add(doc); server.commit();

89. Solr Demo Using Evernote Data

90. Indexed using solr-feeder

91. https://github.com/pascaldimassimo/solr-feeder

92. Solr Features Text Highlighting

93. Spell Checking

94. Forced Placements

95. More Like This

96. Replication

97. Database connector

98. Geo-location

99. More Information

Lucene And Solr Intro

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to Lucene And Solr Intro (20)

Recently uploaded (20)

Lucene And Solr Intro

Editor's Notes