SlideShare a Scribd company logo
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Faceting optimizations for Solr
Toke Eskildsen
Search Engineer / Solr Hacker
State and University Library, Denmark
@TokeEskildsen / te@statsbiblioteket.dk
3/55
3
Overview
l  Web scale at the State and University Library,
Denmark
l  Field faceting 101
l  Optimizations
-  Reuse
-  Tracking
-  Caching
-  Alternative counters
4/55
Web scale for a small web
l  Denmark
-  Consolidation circa 10th century
-  5.6 million people
l  Danish Net Archive (http://netarkivet.dk)
-  Constitution 2005
-  20 billion items / 590TB+ raw data
5/55
Indexing 20 billion web items / 590TB into Solr
l  Solr index size is 1/9th of real data = 70TB
l  Each shard holds 200M documents / 900GB
-  Shards build chronologically by dedicated machine
-  Projected 80 shards
-  Current build time per shard: 4 days
-  Total build time is 20 CPU-core years
-  So far only 7.4 billion documents / 27TB in index
6/55
Searching a 7.4 billion documents / 27TB Solr index
l  SolrCloud with 2 machines, each having
-  16 HT-cores, 256GB RAM, 25 * 930GB SSD
-  25 shards @ 900GB
-  1 Solr/shard/SSD, Xmx=8g, Solr 4.10
-  Disk cache 100GB or < 1% of index size
7/55
8/55
String faceting 101 (single shard)
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
for entry: priorityQueue
result.add(resolveTerm(ordinal), count)
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3
9/55
Test setup 1 (easy start)
l  Solr setup
-  16 HT-cores, 256GB RAM, SSD
-  Single shard 250M documents / 900GB
l  URL field
-  Single String value
-  200M unique terms
l  3 concurrent “users”
l  Random search terms
10/55
Vanilla Solr, single shard, 250M documents, 200M values, 3 users
11/55
Allocating and dereferencing 800MB arrays
12/55
Reuse the counter
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
<counter no more referenced and will be garbage collected at some point>
13/55
Reuse the counter
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
Note: The JSON Facet API in Solr 5 already supports reuse of counters
14/55
Using and clearing 800MB arrays
15/55
Reusing counters vs. not doing so
16/55
Reusing counters, now with readable visualization
17/55
Reusing counters, now with readable visualization
Why does it always take more than 500ms?
18/55
Iteration is not free
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
200M unique terms = 800MB
19/55
ord counter
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
tracker
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Tracking updated counters
20/55
ord counter
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++
Tracking updated counters
21/55
ord counter
0 0
1 1
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++
Tracking updated counters
22/55
ord counter
0 0
1 3
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++
counter[1]++
counter[1]++
Tracking updated counters
23/55
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++
counter[1]++
counter[1]++
counter[8]++
counter[8]++
counter[4]++
counter[8]++
counter[5]++
counter[1]++
counter[1]++
…
counter[1]++
Tracking updated counters
24/55
Tracking updated counters
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
if counter[ordinal]++ == 0 && tracked < maxTracked
tracker[tracked++] = ordinal
if tracked < maxTracked
for i = 0 ; i < tracked ; i++
priorityQueue.add(tracker[i], counter[tracker[i]])
else
for ordinal = 0 ; ordinal < counter.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A
25/55
Tracking updated counters
26/55
Distributed faceting
Phase 1) All shards performs faceting.
The Merger calculates the top-X terms.
Phase 2) The term counts are requested from the shards
that did not return them in phase 1.
The Merger calculates the final counts for the top-X terms.
for term: fineCountRequest.getTerms()
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))
27/55
Test setup 2 (more shards, smaller field)
l  Solr setup
-  16 HT-cores, 256GB RAM, SSD
-  9 shards @ 250M documents / 900GB
l  domain field
-  Single String value
-  1.1M unique terms per shard
l  1 concurrent “user”
l  Random search terms
28/55
Pit of Pain™ (or maybe “Horrible Hill”?)
29/55
Fine counting can be slow
Phase 1: Standard faceting
Phase 2:
for term: fineCountRequest.getTerms()
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))
30/55
Alternative fine counting
counter = pool.getCounter()
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter.increment(ordinal)
for term: fineCountRequest.getTerms()
result.add(term, counter.get(getOrdinal(term)))
}Same as phase 1, which yields
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
31/55
Using cached counters from phase 1 in phase 2
counter = pool.getCounter(key)
for term: query.getTerms()
result.add(term, counter.get(getOrdinal(term)))
pool.release(counter)
32/55
Pit of Pain™ practically eliminated
33/55
Pit of Pain™ practically eliminated
Stick figure CC BY-NC 2.5 Randall Munroe xkcd.com
34/55
Test setup 3 (more shards, more fields)
l  Solr setup
-  16 HT-cores, 256GB RAM, SSD
-  23 shards @ 250M documents / 900GB
l  Faceting on 6 fields
-  url: ~200M unique terms / shard
-  domain & host: ~1M unique terms each / shard
-  type, suffix, year: < 1000 unique terms / shard
35/55
1 machine, 7 billion documents / 23TB total index, 6 facet fields
36/55
High-cardinality can mean different things
Single shard / 250,000,000 docs / 900GB
Field References Max docs/term Unique terms
domain 250,000,000 3,000,000 1,100,000
url 250,000,000 56,000 200,000,000
links 5,800,000,000 5,000,000 610,000,000
2440 MB / counter
37/55
Different distributions
domain 1.1M url 200M links 600M
High max
Low max
Very long tail
Short tail
38/55
Theoretical lower limit per counter: log2(max_count)
max=1
max=7
max=2047
max=3
max=63
39/55
int vs. PackedInts
domain: 4 MB
url: 780 MB
links: 2350 MB
int[ordinals] PackedInts(ordinals, maxBPV)
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)
40/55
n-plane-z counters
Platonic ideal Harsh reality
Plane d
Plane c
Plane b
Plane a
41/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
42/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
43/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
44/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101
45/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101
L: 4 ≣ 000111
L: 5 ≣ 001001
L: 6 ≣ 001011
L: 7 ≣ 001101
...
L: 12 ≣ 010111
46/55
Comparison of counter structures
domain: 4 MB
url: 780 MB
links: 2350 MB
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)
domain: 1 MB (30%)
url: 66 MB ( 8%)
links: 311 MB (13%)
int[ordinals] PackedInts(ordinals, maxBPV) n-plane-z
47/55
Speed comparison
48/55
I could go on about
l  Threaded counting
l  Heuristic faceting
l  Fine count skipping
l  Counter capping
l  Monotonically increasing tracker for n-plane-z
l  Regexp filtering
49/55
What about huge result sets?
l  Rare for explorative term-based searches
l  Common for batch extractions
l  Threading works poorly as #shards > #CPUs
l  But how bad is it really?
50/55
Really bad! 8 minutes
51/55
Heuristic faceting
l  Use sampling to guess top-X terms
-  Re-use the existing tracked counters
-  1:1000 sampling seems usable for the field links,
which has 5 billion references per shard
l  Fine-count the guessed terms
52/55
Over provisioning helps validity
53/55
10 seconds < 8 minutes
54/55
Web scale for a small web
l  Denmark
-  Consolidation circa 10th century
-  5.6 million people
l  Danish Net Archive (http://netarkivet.dk)
-  Constitution 2005
-  20 billion items / 590TB+ raw data
55/55
Never enough time, but talk to me about
l  Threaded counting
l  Monotonically increasing tracker for n-plane-z
l  Regexp filtering
l  Fine count skipping
l  Counter capping

More Related Content

PDF
Mail Search As A Sercive: Presented by Rishi Easwaran, Aol
PDF
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
PDF
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
PDF
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
PDF
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
PDF
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
PDF
High Performance Solr
PDF
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Mail Search As A Sercive: Presented by Rishi Easwaran, Aol
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
High Performance Solr
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...

What's hot (20)

PPTX
Benchmarking Solr Performance at Scale
PDF
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
PDF
Building a near real time search engine & analytics for logs using solr
PPTX
NYC Lucene/Solr Meetup: Spark / Solr
PDF
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
PPTX
Solr Search Engine: Optimize Is (Not) Bad for You
PDF
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
PDF
Introduction to SolrCloud
PPTX
Data analysis scala_spark
PDF
How to make a simple cheap high availability self-healing solr cluster
PDF
Cross Datacenter Replication in Apache Solr 6
PDF
Call me maybe: Jepsen and flaky networks
PDF
Solr cluster with SolrCloud at lucenerevolution (tutorial)
PDF
Spark Summit EU talk by Ted Malaska
PDF
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PDF
Automated Spark Deployment With Declarative Infrastructure
ODP
Apache SolrCloud
Benchmarking Solr Performance at Scale
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Building a near real time search engine & analytics for logs using solr
NYC Lucene/Solr Meetup: Spark / Solr
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Solr Search Engine: Optimize Is (Not) Bad for You
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Introduction to SolrCloud
Data analysis scala_spark
How to make a simple cheap high availability self-healing solr cluster
Cross Datacenter Replication in Apache Solr 6
Call me maybe: Jepsen and flaky networks
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Spark Summit EU talk by Ted Malaska
Faster Data Analytics with Apache Spark using Apache Solr - Kiran Chitturi, L...
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Automated Spark Deployment With Declarative Infrastructure
Apache SolrCloud
Ad

Viewers also liked (18)

PPT
Faceting optimizations for Solr
PDF
Top apache solr features
PDF
Solr: 4 big features
PDF
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
PDF
Improving Enterprise Findability: Presented by Jayesh Govindarajan, Salesforce
PDF
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
PDF
Search++: Cognitive transformation of human-system interaction: Presented by ...
PPTX
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
PDF
Fusion 3 Overview Webinar
PPTX
Solr Exchange: Introduction to SolrCloud
PPTX
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
PDF
Webinar: Ecommerce, Rules, and Relevance
PDF
Webinar: Site Search in an Hour with Fusion
PDF
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
PDF
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
PPTX
The Apache Solr Smart Data Ecosystem
PDF
Scaling search with Solr Cloud
PDF
Webinar: Building Conversational Search with Fusion
Faceting optimizations for Solr
Top apache solr features
Solr: 4 big features
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Improving Enterprise Findability: Presented by Jayesh Govindarajan, Salesforce
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Search++: Cognitive transformation of human-system interaction: Presented by ...
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Fusion 3 Overview Webinar
Solr Exchange: Introduction to SolrCloud
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Webinar: Ecommerce, Rules, and Relevance
Webinar: Site Search in an Hour with Fusion
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
The Apache Solr Smart Data Ecosystem
Scaling search with Solr Cloud
Webinar: Building Conversational Search with Fusion
Ad

Similar to Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark (20)

ODP
Solr sparse faceting
PDF
Introduction to solr
PPTX
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
PDF
Seeley yonik solr performance key innovations
PPTX
Lots of facets, fast
PDF
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
PDF
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
PDF
PDF
Consuming RealTime Signals in Solr
PDF
Solr 3.1 and beyond
ODP
Large scale net_archive_toke_eskildsen_iipc_workshop_2015
PDF
Faceted Search with Lucene
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
PDF
KEYNOTE: Lucene / Solr road map
PDF
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
PPTX
Apache solr
PPT
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
PPT
Faceted Search – the 120 Million Documents Story
PDF
Webinar: Faster Log Indexing with Fusion
Solr sparse faceting
Introduction to solr
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
Seeley yonik solr performance key innovations
Lots of facets, fast
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Consuming RealTime Signals in Solr
Solr 3.1 and beyond
Large scale net_archive_toke_eskildsen_iipc_workshop_2015
Faceted Search with Lucene
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
KEYNOTE: Lucene / Solr road map
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Building a Large Scale SEO/SEM Application with Apache Solr
Apache solr
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015
Faceted Search – the 120 Million Documents Story
Webinar: Faster Log Indexing with Fusion

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
A Presentation on Artificial Intelligence
PDF
Modernizing your data center with Dell and AMD
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Advanced methodologies resolving dimensionality complications for autism neur...
A Presentation on Artificial Intelligence
Modernizing your data center with Dell and AMD
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

  • 1. O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
  • 2. Faceting optimizations for Solr Toke Eskildsen Search Engineer / Solr Hacker State and University Library, Denmark @TokeEskildsen / te@statsbiblioteket.dk
  • 3. 3/55 3 Overview l  Web scale at the State and University Library, Denmark l  Field faceting 101 l  Optimizations -  Reuse -  Tracking -  Caching -  Alternative counters
  • 4. 4/55 Web scale for a small web l  Denmark -  Consolidation circa 10th century -  5.6 million people l  Danish Net Archive (http://netarkivet.dk) -  Constitution 2005 -  20 billion items / 590TB+ raw data
  • 5. 5/55 Indexing 20 billion web items / 590TB into Solr l  Solr index size is 1/9th of real data = 70TB l  Each shard holds 200M documents / 900GB -  Shards build chronologically by dedicated machine -  Projected 80 shards -  Current build time per shard: 4 days -  Total build time is 20 CPU-core years -  So far only 7.4 billion documents / 27TB in index
  • 6. 6/55 Searching a 7.4 billion documents / 27TB Solr index l  SolrCloud with 2 machines, each having -  16 HT-cores, 256GB RAM, 25 * 930GB SSD -  25 shards @ 900GB -  1 Solr/shard/SSD, Xmx=8g, Solr 4.10 -  Disk cache 100GB or < 1% of index size
  • 8. 8/55 String faceting 101 (single shard) counter = new int[ordinals] for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++ for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal]) for entry: priorityQueue result.add(resolveTerm(ordinal), count) ord term counter 0 A 0 1 B 3 2 C 0 3 D 1006 4 E 1 5 F 1 6 G 0 7 H 0 8 I 3
  • 9. 9/55 Test setup 1 (easy start) l  Solr setup -  16 HT-cores, 256GB RAM, SSD -  Single shard 250M documents / 900GB l  URL field -  Single String value -  200M unique terms l  3 concurrent “users” l  Random search terms
  • 10. 10/55 Vanilla Solr, single shard, 250M documents, 200M values, 3 users
  • 12. 12/55 Reuse the counter counter = new int[ordinals] for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++ for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal]) <counter no more referenced and will be garbage collected at some point>
  • 13. 13/55 Reuse the counter counter = pool.getCounter() for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++ for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal]) pool.release(counter) Note: The JSON Facet API in Solr 5 already supports reuse of counters
  • 14. 14/55 Using and clearing 800MB arrays
  • 16. 16/55 Reusing counters, now with readable visualization
  • 17. 17/55 Reusing counters, now with readable visualization Why does it always take more than 500ms?
  • 18. 18/55 Iteration is not free counter = pool.getCounter() for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++ for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal]) pool.release(counter) 200M unique terms = 800MB
  • 19. 19/55 ord counter 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 tracker N/A N/A N/A N/A N/A N/A N/A N/A N/A Tracking updated counters
  • 20. 20/55 ord counter 0 0 1 0 2 0 3 1 4 0 5 0 6 0 7 0 8 0 tracker 3 N/A N/A N/A N/A N/A N/A N/A N/A counter[3]++ Tracking updated counters
  • 21. 21/55 ord counter 0 0 1 1 2 0 3 1 4 0 5 0 6 0 7 0 8 0 tracker 3 1 N/A N/A N/A N/A N/A N/A N/A counter[3]++ counter[1]++ Tracking updated counters
  • 22. 22/55 ord counter 0 0 1 3 2 0 3 1 4 0 5 0 6 0 7 0 8 0 tracker 3 1 N/A N/A N/A N/A N/A N/A N/A counter[3]++ counter[1]++ counter[1]++ counter[1]++ Tracking updated counters
  • 23. 23/55 ord counter 0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3 tracker 3 1 8 4 5 N/A N/A N/A N/A counter[3]++ counter[1]++ counter[1]++ counter[1]++ counter[8]++ counter[8]++ counter[4]++ counter[8]++ counter[5]++ counter[1]++ counter[1]++ … counter[1]++ Tracking updated counters
  • 24. 24/55 Tracking updated counters counter = pool.getCounter() for docID: result.getDocIDs() for ordinal: getOrdinals(docID) if counter[ordinal]++ == 0 && tracked < maxTracked tracker[tracked++] = ordinal if tracked < maxTracked for i = 0 ; i < tracked ; i++ priorityQueue.add(tracker[i], counter[tracker[i]]) else for ordinal = 0 ; ordinal < counter.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal]) ord counter 0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3 tracker 3 1 8 4 5 N/A N/A N/A N/A
  • 26. 26/55 Distributed faceting Phase 1) All shards performs faceting. The Merger calculates the top-X terms. Phase 2) The term counts are requested from the shards that did not return them in phase 1. The Merger calculates the final counts for the top-X terms. for term: fineCountRequest.getTerms() result.add(term, searcher.numDocs(query(field:term), base.getDocIDs()))
  • 27. 27/55 Test setup 2 (more shards, smaller field) l  Solr setup -  16 HT-cores, 256GB RAM, SSD -  9 shards @ 250M documents / 900GB l  domain field -  Single String value -  1.1M unique terms per shard l  1 concurrent “user” l  Random search terms
  • 28. 28/55 Pit of Pain™ (or maybe “Horrible Hill”?)
  • 29. 29/55 Fine counting can be slow Phase 1: Standard faceting Phase 2: for term: fineCountRequest.getTerms() result.add(term, searcher.numDocs(query(field:term), base.getDocIDs()))
  • 30. 30/55 Alternative fine counting counter = pool.getCounter() for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter.increment(ordinal) for term: fineCountRequest.getTerms() result.add(term, counter.get(getOrdinal(term))) }Same as phase 1, which yields ord counter 0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3
  • 31. 31/55 Using cached counters from phase 1 in phase 2 counter = pool.getCounter(key) for term: query.getTerms() result.add(term, counter.get(getOrdinal(term))) pool.release(counter)
  • 32. 32/55 Pit of Pain™ practically eliminated
  • 33. 33/55 Pit of Pain™ practically eliminated Stick figure CC BY-NC 2.5 Randall Munroe xkcd.com
  • 34. 34/55 Test setup 3 (more shards, more fields) l  Solr setup -  16 HT-cores, 256GB RAM, SSD -  23 shards @ 250M documents / 900GB l  Faceting on 6 fields -  url: ~200M unique terms / shard -  domain & host: ~1M unique terms each / shard -  type, suffix, year: < 1000 unique terms / shard
  • 35. 35/55 1 machine, 7 billion documents / 23TB total index, 6 facet fields
  • 36. 36/55 High-cardinality can mean different things Single shard / 250,000,000 docs / 900GB Field References Max docs/term Unique terms domain 250,000,000 3,000,000 1,100,000 url 250,000,000 56,000 200,000,000 links 5,800,000,000 5,000,000 610,000,000 2440 MB / counter
  • 37. 37/55 Different distributions domain 1.1M url 200M links 600M High max Low max Very long tail Short tail
  • 38. 38/55 Theoretical lower limit per counter: log2(max_count) max=1 max=7 max=2047 max=3 max=63
  • 39. 39/55 int vs. PackedInts domain: 4 MB url: 780 MB links: 2350 MB int[ordinals] PackedInts(ordinals, maxBPV) domain: 3 MB (72%) url: 420 MB (53%) links: 1760 MB (75%)
  • 40. 40/55 n-plane-z counters Platonic ideal Harsh reality Plane d Plane c Plane b Plane a
  • 41. 41/55 Plane d Plane c Plane b Plane a L: 0 ≣ 000000
  • 42. 42/55 Plane d Plane c Plane b Plane a L: 0 ≣ 000000 L: 1 ≣ 000001
  • 43. 43/55 Plane d Plane c Plane b Plane a L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011
  • 44. 44/55 Plane d Plane c Plane b Plane a L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011 L: 3 ≣ 000101
  • 45. 45/55 Plane d Plane c Plane b Plane a L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011 L: 3 ≣ 000101 L: 4 ≣ 000111 L: 5 ≣ 001001 L: 6 ≣ 001011 L: 7 ≣ 001101 ... L: 12 ≣ 010111
  • 46. 46/55 Comparison of counter structures domain: 4 MB url: 780 MB links: 2350 MB domain: 3 MB (72%) url: 420 MB (53%) links: 1760 MB (75%) domain: 1 MB (30%) url: 66 MB ( 8%) links: 311 MB (13%) int[ordinals] PackedInts(ordinals, maxBPV) n-plane-z
  • 48. 48/55 I could go on about l  Threaded counting l  Heuristic faceting l  Fine count skipping l  Counter capping l  Monotonically increasing tracker for n-plane-z l  Regexp filtering
  • 49. 49/55 What about huge result sets? l  Rare for explorative term-based searches l  Common for batch extractions l  Threading works poorly as #shards > #CPUs l  But how bad is it really?
  • 51. 51/55 Heuristic faceting l  Use sampling to guess top-X terms -  Re-use the existing tracked counters -  1:1000 sampling seems usable for the field links, which has 5 billion references per shard l  Fine-count the guessed terms
  • 53. 53/55 10 seconds < 8 minutes
  • 54. 54/55 Web scale for a small web l  Denmark -  Consolidation circa 10th century -  5.6 million people l  Danish Net Archive (http://netarkivet.dk) -  Constitution 2005 -  20 billion items / 590TB+ raw data
  • 55. 55/55 Never enough time, but talk to me about l  Threaded counting l  Monotonically increasing tracker for n-plane-z l  Regexp filtering l  Fine count skipping l  Counter capping