SlideShare a Scribd company logo
Introduction to Apache Solr
Andrew Jackson
UK Web Archive Technical Lead
www.bl.uk 2
Web Archive Overall Architecture
www.bl.uk 3
Understanding Your Use Case(s)
• Full text search, right?
– Yes, but there are many variations and choices to make.
• Work with users to understand their information needs:
– Are they looking for…
• Particular (archived) web resources?
• Resources on a particular issue or subject?
• Evidence of trends over time?
– What aspects of the content do they consider important?
– What kind of outputs do they want?
www.bl.uk 4
Working With Historians…
• JISC AADDA Project:
– Initial index and UI of the 1996-2010 data
– Great learning experience and feedback
– http://domaindarkarchive.blogspot.co.uk/
• AHRC ‘Big Data’ Project:
– Second iteration of index and UI
– Bursary holders reports coming soon
– http://buddah.projects.history.ac.uk/
• Interested in trends and reflections of society
– Who links to who/what, over time?
www.bl.uk 5
Apache Solr & Lucene
• Apache Lucene:
– A Java library for full text indexes
• Apache Solr:
– A web service and API that exposes Lucene functionality in a
as a document database
– Supports SolrCloud mode for distributed searches
• See also:
– Elasticsearch (also built around Lucene)
– We ‘chose’ Solr before Elasticsearch existed
– http://solr-vs-elasticsearch.com/
www.bl.uk 6
Example: Indexing Quotes
• Quotes to be indexed:
– “To do is to be.” - Jean-Paul Sartre
– “To be is to do.” - Socrates
– “Do be do be do.” - Frank Sinatra
• Goals:
– Index the quotation for full-text search.
• e.g. Show me all quotes that contain “to be”.
– Index the author for faceted search.
• e.g. Show me all quotes by “Frank Sinatra”.
www.bl.uk 7
Lucene’s Inverted Indexes
www.bl.uk 8
Solr as a Document Database
• Solr Indexes/Stores & Retrieves:
– Documents
composed of:
• Multiple Fields
each of which has a defined:
– Field Type
such as ‘text’, ‘string’, ‘int’, etc.
• The queries you can support depend on on many
parameters, but the fields and their types are the most
critical factors.
– See Overview of Documents, Fields, and Schema Design
www.bl.uk 9
The Quotes As Solr Documents
• Our Documents contain three fields:
– ‘id’ field of type ‘string’
– ‘text’ field of type ‘text_general’
– ‘author’ field, of type ‘string’
• Example Documents:
– id: “1”, text: “To do is to be.”, author: “Jean-Paul Sartre”
– id: “2”, text: “To be is to do.”, author: “Socrates”
– id: “3”, text: “Do be do be do.”, author: “Frank Sinatra”
www.bl.uk 10
Solr Update Flow
www.bl.uk 11
Analyzing The Text Field
• Analyzing the text on document 1:
– Input: “To do is to be.”, type = ‘text_general’
– Standard Tokeniser:
• ‘To’ ‘be’ ‘is’ ‘to’ ‘do’
– Lower Case Filter:
• ‘to’ ‘be’ ‘is’ ‘to’ ‘do’
• Adding the tokens to the index:
– ‘be’ => id:1
– ‘do’ => id:1
– …
www.bl.uk 12
Analyzing The Author Field
• Analyzing the author on document 1:
– Input: “Jean-Paul Sartre”, type = ‘string’
– Strings are stored as is.
• Adding the tokens to the index:
– ‘Jean-Paul Sartre’ => id:1
www.bl.uk 13
Solr Query Flow
www.bl.uk 14
Query for text:“To be”
• Uses the same analyser
as the indexer:
– “To be?”
– ST: “To” “be”
– LCF: “to” “be”
• Returns
documents:
– 1
– 2
www.bl.uk 15
Solr’s Built-in UI
www.bl.uk 16
Solr Overall Flow
www.bl.uk 17
Choice: Ignore ‘stop words’?
• Removes common words, unrelated to subject/topic
– Input: “To do is to be”
– Standard Tokeniser:
• ‘To’ ‘be’ ‘is’ ‘to’ ‘do’
– Stop Words Filter (stopwords_en.txt):
• ‘do’
– Lower Case Filter:
• ‘do’
• Cannot support phrase search
– e.g. searching for “to be”
www.bl.uk 18
Choice: Stemming?
• Attempts to group concepts together:
– "fishing", "fished”, "fisher" => "fish"
– "argue", "argued", "argues", "arguing”, "argus” => "argu"
• Sometimes confused:
– "axes” => "axe”, or ”axis”?
• Better at grouping related items together
• Makes precise phrase searching difficult
www.bl.uk 19
So Many Choices…
• Lots of text indexing options to tune:
– Punctuation and tokenization:
• is www.google.com one or three tokens?
– Stop word filter (“the” => “”)
– Lower case filter (“This” => “this”)
– Stemming (choice of algorithms too)
– Keywords (excepted from stemming)
– Synonyms (“TV” => “Television”)
– Possessive Filter (“Blair’s” => “Blair”)
– …and many more Tokenizers and Filters.
www.bl.uk 20
Even More Choices: Query Features
• As well as full-text search variations, we have
– Query parsers and features:
• Proximity, wildcards, term frequencies, relevance…
– Faceted search
– Numeric or Date values and range queries
– Geographic data and spatial search
– Snippets/fragments and highlighting
– Spell checking i.e. ‘Did you mean …?’
– MoreLikeThis
– Clustering
www.bl.uk 21
How to get started?
• Experimenting with the UKWA stack:
– Indexing:
• webarchive-discovery
– User Interfaces:
• Drupal Sarnia
• Shine (Play Framework, by UKWA)
• See
https://github.com/ukwa/webarchive-discovery/wiki/Front-
ends
www.bl.uk 22
The webarchive-discovery system
• The webarchive-discovery codebase is an indexing stack
that reflects our (UKWA) use cases
– Contains our choices, reflects our progress so far
– Turns ARC or WARC records into Solr Documents
– Highly robust against (W)ARC data quality problems
• Adds custom fields for web archiving
– Text extracted using Apache Tika
– Various other analysis features
• Workshop sessions will use our setup
– but this is only a starting point…
www.bl.uk 23
Features: Basic Metadata Fields
• From the file system:
– The source (W)ARC filename and offset
• From the WARC record:
– URL, host, domain, public suffix
– Crawl date(s)
• From the HTTP headers:
– Content length
– Content type (as served)
– Server software IDs
www.bl.uk 24
Features: Payload Analysis
• Binary hash, embedded metadata
• Format and preservation risk analysis:
– Apache Tika & DROID format and encoding ID
– Notes parse errors to spot access problems
– Apache Preflight PDF risk analysis
– XML root namespace
– Format signature generation tricks
• HTML links, elements used, licence/rights URL
• Image properties, dominant colours, face detection
www.bl.uk 25
Features: Text Analysis
• Text extraction from binary formats
• ‘Fuzzy’ hash (ssdeep) of text
– for similarity analysis
• Natural language detection
• UK postcode extraction and geo-indexing
• Experimental language analysis:
– Simplistic sentiment analysis
– Stanford NLP named entity extraction
– Initial GATE NLP analyser
www.bl.uk 26
Command-line Indexing Architecture
www.bl.uk 27
Hadoop Indexing Architecture
www.bl.uk 28
Scaling Solr
• We are operating outside Solr’s sweet spot:
– General recommendation is RAM = Index Size
– We have a 15TB index. That’s a lot of RAM.
• e.g. from this email
– “100 million documents [and 16-32GB] per node”
– “it's quite the fool's errand for average developers to try to
replicate the "heroic efforts" of the few.”
• So how to scale up?
www.bl.uk 29
Basic Index Performance Scaling
• One Query:
– Single-threaded binary search
– Seek-and-read speed is critical, not CPU
• Add RAID/SAN?
– More IOPS can support more concurrent queries
– BUT each query is no faster
• Want faster queries?
– Use SSD, and/or
– More RAM to cache more disk, and/or
– Split the data into more shards (on independent media)
www.bl.uk 30
Sharding & SolrCloud
• For > ~100 million documents, use shards
– More, smaller independent shards == faster search
• Shard generation:
– SolrCloud ‘Live’ shards
• We use Solr’s standard sharding
• Randomly distributes records
• Supports updates to records
– Manual sharding
• e.g. ‘static’ shards generated from files
• As used by the Danish web archive (see later today)
www.bl.uk 31
Next Steps
• Prototype, Prototype, Prototype
– Expect to re-index
– Expect to iterate your front and back end systems
– Seek real user feedback
• Benchmark, Benchmark, Benchmark
– More on scaling issues and benchmarking this afternoon
• Work Together
– Share use cases, indexing tactics
– Share system specs, benchmarks
– Share code where appropriate

More Related Content

PPTX
Introduction to Apache Solr
PDF
Find it, possibly also near you!
PDF
Get the most out of Solr search with PHP
PDF
Apache Solr crash course
PPTX
Introduction to Apache Lucene/Solr
PPTX
Apache solr
PPTX
Search Me: Using Lucene.Net
PDF
Introduction to Solr
Introduction to Apache Solr
Find it, possibly also near you!
Get the most out of Solr search with PHP
Apache Solr crash course
Introduction to Apache Lucene/Solr
Apache solr
Search Me: Using Lucene.Net
Introduction to Solr

Similar to IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf (20)

PDF
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
PPTX
Introduction to Lucene and Solr - 1
PDF
Solr Masterclass Bangkok, June 2014
PPT
Building Intelligent Search Applications with Apache Solr and PHP5
PPTX
Apache Solr-Webinar
PDF
Introduction to Solr
PDF
Search Engine-Building with Lucene and Solr
PPTX
IIPC GA 2014 Solr
PPTX
Introduction to Lucene & Solr and Usecases
PDF
Apace Solr Web Development.pdf
PDF
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
KEY
Solr 101
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
PDF
PDF
A Practical Introduction to Apache Solr
PPTX
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
PDF
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
KEY
Apache Solr - Enterprise search platform
PDF
Rapid Prototyping with Solr
PDF
Rapid prototyping with solr - By Erik Hatcher
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Introduction to Lucene and Solr - 1
Solr Masterclass Bangkok, June 2014
Building Intelligent Search Applications with Apache Solr and PHP5
Apache Solr-Webinar
Introduction to Solr
Search Engine-Building with Lucene and Solr
IIPC GA 2014 Solr
Introduction to Lucene & Solr and Usecases
Apace Solr Web Development.pdf
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Solr 101
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
A Practical Introduction to Apache Solr
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Apache Solr - Enterprise search platform
Rapid Prototyping with Solr
Rapid prototyping with solr - By Erik Hatcher
Ad

More from Matrix823409 (10)

PDF
Presentation_on_Introduction_of_Stock Market
PDF
Introduction to REST - REST Basics - JSON
PPTX
Working and Attacking GraphQL APIs vs Rest API
PDF
658882010-Introduction-to-Docker-pptx.pdf
PPTX
AGILE SOFTWARE DEVELOPMENT - General Overview
PDF
SCRUM and XP Methodologies and Practices
PPT
webservicearchitecture-150614164814-lva1-app6892.ppt
PDF
Internet Applications and Web Programming
PPTX
Service-oriented software engineering & APIs
PPT
Introduction to Web Services - Architecture
Presentation_on_Introduction_of_Stock Market
Introduction to REST - REST Basics - JSON
Working and Attacking GraphQL APIs vs Rest API
658882010-Introduction-to-Docker-pptx.pdf
AGILE SOFTWARE DEVELOPMENT - General Overview
SCRUM and XP Methodologies and Practices
webservicearchitecture-150614164814-lva1-app6892.ppt
Internet Applications and Web Programming
Service-oriented software engineering & APIs
Introduction to Web Services - Architecture
Ad

Recently uploaded (20)

PPTX
Cell Structure & Organelles in detailed.
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
master seminar digital applications in india
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPH.pptx obstetrics and gynecology in nursing
Cell Structure & Organelles in detailed.
VCE English Exam - Section C Student Revision Booklet
2.FourierTransform-ShortQuestionswithAnswers.pdf
Institutional Correction lecture only . . .
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Renaissance Architecture: A Journey from Faith to Humanism
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Week 4 Term 3 Study Techniques revisited.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Supply Chain Operations Speaking Notes -ICLT Program
Final Presentation General Medicine 03-08-2024.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
master seminar digital applications in india
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
O5-L3 Freight Transport Ops (International) V1.pdf
Microbial diseases, their pathogenesis and prophylaxis
Pharma ospi slides which help in ospi learning
PPH.pptx obstetrics and gynecology in nursing

IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf

  • 1. Introduction to Apache Solr Andrew Jackson UK Web Archive Technical Lead
  • 2. www.bl.uk 2 Web Archive Overall Architecture
  • 3. www.bl.uk 3 Understanding Your Use Case(s) • Full text search, right? – Yes, but there are many variations and choices to make. • Work with users to understand their information needs: – Are they looking for… • Particular (archived) web resources? • Resources on a particular issue or subject? • Evidence of trends over time? – What aspects of the content do they consider important? – What kind of outputs do they want?
  • 4. www.bl.uk 4 Working With Historians… • JISC AADDA Project: – Initial index and UI of the 1996-2010 data – Great learning experience and feedback – http://domaindarkarchive.blogspot.co.uk/ • AHRC ‘Big Data’ Project: – Second iteration of index and UI – Bursary holders reports coming soon – http://buddah.projects.history.ac.uk/ • Interested in trends and reflections of society – Who links to who/what, over time?
  • 5. www.bl.uk 5 Apache Solr & Lucene • Apache Lucene: – A Java library for full text indexes • Apache Solr: – A web service and API that exposes Lucene functionality in a as a document database – Supports SolrCloud mode for distributed searches • See also: – Elasticsearch (also built around Lucene) – We ‘chose’ Solr before Elasticsearch existed – http://solr-vs-elasticsearch.com/
  • 6. www.bl.uk 6 Example: Indexing Quotes • Quotes to be indexed: – “To do is to be.” - Jean-Paul Sartre – “To be is to do.” - Socrates – “Do be do be do.” - Frank Sinatra • Goals: – Index the quotation for full-text search. • e.g. Show me all quotes that contain “to be”. – Index the author for faceted search. • e.g. Show me all quotes by “Frank Sinatra”.
  • 8. www.bl.uk 8 Solr as a Document Database • Solr Indexes/Stores & Retrieves: – Documents composed of: • Multiple Fields each of which has a defined: – Field Type such as ‘text’, ‘string’, ‘int’, etc. • The queries you can support depend on on many parameters, but the fields and their types are the most critical factors. – See Overview of Documents, Fields, and Schema Design
  • 9. www.bl.uk 9 The Quotes As Solr Documents • Our Documents contain three fields: – ‘id’ field of type ‘string’ – ‘text’ field of type ‘text_general’ – ‘author’ field, of type ‘string’ • Example Documents: – id: “1”, text: “To do is to be.”, author: “Jean-Paul Sartre” – id: “2”, text: “To be is to do.”, author: “Socrates” – id: “3”, text: “Do be do be do.”, author: “Frank Sinatra”
  • 11. www.bl.uk 11 Analyzing The Text Field • Analyzing the text on document 1: – Input: “To do is to be.”, type = ‘text_general’ – Standard Tokeniser: • ‘To’ ‘be’ ‘is’ ‘to’ ‘do’ – Lower Case Filter: • ‘to’ ‘be’ ‘is’ ‘to’ ‘do’ • Adding the tokens to the index: – ‘be’ => id:1 – ‘do’ => id:1 – …
  • 12. www.bl.uk 12 Analyzing The Author Field • Analyzing the author on document 1: – Input: “Jean-Paul Sartre”, type = ‘string’ – Strings are stored as is. • Adding the tokens to the index: – ‘Jean-Paul Sartre’ => id:1
  • 14. www.bl.uk 14 Query for text:“To be” • Uses the same analyser as the indexer: – “To be?” – ST: “To” “be” – LCF: “to” “be” • Returns documents: – 1 – 2
  • 17. www.bl.uk 17 Choice: Ignore ‘stop words’? • Removes common words, unrelated to subject/topic – Input: “To do is to be” – Standard Tokeniser: • ‘To’ ‘be’ ‘is’ ‘to’ ‘do’ – Stop Words Filter (stopwords_en.txt): • ‘do’ – Lower Case Filter: • ‘do’ • Cannot support phrase search – e.g. searching for “to be”
  • 18. www.bl.uk 18 Choice: Stemming? • Attempts to group concepts together: – "fishing", "fished”, "fisher" => "fish" – "argue", "argued", "argues", "arguing”, "argus” => "argu" • Sometimes confused: – "axes” => "axe”, or ”axis”? • Better at grouping related items together • Makes precise phrase searching difficult
  • 19. www.bl.uk 19 So Many Choices… • Lots of text indexing options to tune: – Punctuation and tokenization: • is www.google.com one or three tokens? – Stop word filter (“the” => “”) – Lower case filter (“This” => “this”) – Stemming (choice of algorithms too) – Keywords (excepted from stemming) – Synonyms (“TV” => “Television”) – Possessive Filter (“Blair’s” => “Blair”) – …and many more Tokenizers and Filters.
  • 20. www.bl.uk 20 Even More Choices: Query Features • As well as full-text search variations, we have – Query parsers and features: • Proximity, wildcards, term frequencies, relevance… – Faceted search – Numeric or Date values and range queries – Geographic data and spatial search – Snippets/fragments and highlighting – Spell checking i.e. ‘Did you mean …?’ – MoreLikeThis – Clustering
  • 21. www.bl.uk 21 How to get started? • Experimenting with the UKWA stack: – Indexing: • webarchive-discovery – User Interfaces: • Drupal Sarnia • Shine (Play Framework, by UKWA) • See https://github.com/ukwa/webarchive-discovery/wiki/Front- ends
  • 22. www.bl.uk 22 The webarchive-discovery system • The webarchive-discovery codebase is an indexing stack that reflects our (UKWA) use cases – Contains our choices, reflects our progress so far – Turns ARC or WARC records into Solr Documents – Highly robust against (W)ARC data quality problems • Adds custom fields for web archiving – Text extracted using Apache Tika – Various other analysis features • Workshop sessions will use our setup – but this is only a starting point…
  • 23. www.bl.uk 23 Features: Basic Metadata Fields • From the file system: – The source (W)ARC filename and offset • From the WARC record: – URL, host, domain, public suffix – Crawl date(s) • From the HTTP headers: – Content length – Content type (as served) – Server software IDs
  • 24. www.bl.uk 24 Features: Payload Analysis • Binary hash, embedded metadata • Format and preservation risk analysis: – Apache Tika & DROID format and encoding ID – Notes parse errors to spot access problems – Apache Preflight PDF risk analysis – XML root namespace – Format signature generation tricks • HTML links, elements used, licence/rights URL • Image properties, dominant colours, face detection
  • 25. www.bl.uk 25 Features: Text Analysis • Text extraction from binary formats • ‘Fuzzy’ hash (ssdeep) of text – for similarity analysis • Natural language detection • UK postcode extraction and geo-indexing • Experimental language analysis: – Simplistic sentiment analysis – Stanford NLP named entity extraction – Initial GATE NLP analyser
  • 28. www.bl.uk 28 Scaling Solr • We are operating outside Solr’s sweet spot: – General recommendation is RAM = Index Size – We have a 15TB index. That’s a lot of RAM. • e.g. from this email – “100 million documents [and 16-32GB] per node” – “it's quite the fool's errand for average developers to try to replicate the "heroic efforts" of the few.” • So how to scale up?
  • 29. www.bl.uk 29 Basic Index Performance Scaling • One Query: – Single-threaded binary search – Seek-and-read speed is critical, not CPU • Add RAID/SAN? – More IOPS can support more concurrent queries – BUT each query is no faster • Want faster queries? – Use SSD, and/or – More RAM to cache more disk, and/or – Split the data into more shards (on independent media)
  • 30. www.bl.uk 30 Sharding & SolrCloud • For > ~100 million documents, use shards – More, smaller independent shards == faster search • Shard generation: – SolrCloud ‘Live’ shards • We use Solr’s standard sharding • Randomly distributes records • Supports updates to records – Manual sharding • e.g. ‘static’ shards generated from files • As used by the Danish web archive (see later today)
  • 31. www.bl.uk 31 Next Steps • Prototype, Prototype, Prototype – Expect to re-index – Expect to iterate your front and back end systems – Seek real user feedback • Benchmark, Benchmark, Benchmark – More on scaling issues and benchmarking this afternoon • Work Together – Share use cases, indexing tactics – Share system specs, benchmarks – Share code where appropriate