Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Faceting optimizations for Solr
Toke Eskildsen
Search Engineer / Solr Hacker
State and University Library, Denmark
@TokeEskildsen / te@statsbiblioteket.dk

3/55
3
Overview
l  Web scale at the State and University Library,
Denmark
l  Field faceting 101
l  Optimizations
-  Reuse
-  Tracking
-  Caching
-  Alternative counters

4/55
Web scale for a small web
l  Denmark
-  Consolidation circa 10th century
-  5.6 million people
l  Danish Net Archive (http://netarkivet.dk)
-  Constitution 2005
-  20 billion items / 590TB+ raw data

5/55
Indexing 20 billion web items / 590TB into Solr
l  Solr index size is 1/9th of real data = 70TB
l  Each shard holds 200M documents / 900GB
-  Shards build chronologically by dedicated machine
-  Projected 80 shards
-  Current build time per shard: 4 days
-  Total build time is 20 CPU-core years
-  So far only 7.4 billion documents / 27TB in index

6/55
Searching a 7.4 billion documents / 27TB Solr index
l  SolrCloud with 2 machines, each having
-  16 HT-cores, 256GB RAM, 25 * 930GB SSD
-  25 shards @ 900GB
-  1 Solr/shard/SSD, Xmx=8g, Solr 4.10
-  Disk cache 100GB or < 1% of index size

8/55
String faceting 101 (single shard)
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
for entry: priorityQueue
result.add(resolveTerm(ordinal), count)
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3

9/55
Test setup 1 (easy start)
l  Solr setup
-  16 HT-cores, 256GB RAM, SSD
-  Single shard 250M documents / 900GB
l  URL field
-  Single String value
-  200M unique terms
l  3 concurrent “users”
l  Random search terms

10/55
Vanilla Solr, single shard, 250M documents, 200M values, 3 users

11/55
Allocating and dereferencing 800MB arrays

12/55
Reuse the counter
counter = new int[ordinals]
counter[ordinal]++
<counter no more referenced and will be garbage collected at some point>

13/55
Reuse the counter
counter = pool.getCounter()
counter[ordinal]++
pool.release(counter)
Note: The JSON Facet API in Solr 5 already supports reuse of counters

14/55
Using and clearing 800MB arrays

15/55
Reusing counters vs. not doing so

16/55
Reusing counters, now with readable visualization

17/55
Reusing counters, now with readable visualization
Why does it always take more than 500ms?

18/55
Iteration is not free
counter[ordinal]++
200M unique terms = 800MB

19/55
ord counter
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
tracker
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
Tracking updated counters

20/55
ord counter
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++

21/55
ord counter
0 0
1 1
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++

22/55
ord counter
0 0
1 3
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++
counter[1]++
counter[1]++

23/55
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A
counter[3]++
counter[1]++
counter[1]++
counter[1]++
counter[8]++
counter[8]++
counter[4]++
counter[8]++
counter[5]++
counter[1]++
counter[1]++
…
counter[1]++

24/55
if counter[ordinal]++ == 0 && tracked < maxTracked
tracker[tracked++] = ordinal
if tracked < maxTracked
for i = 0 ; i < tracked ; i++
priorityQueue.add(tracker[i], counter[tracker[i]])
else
for ordinal = 0 ; ordinal < counter.length ; ordinal++
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A

25/55

26/55
Distributed faceting
Phase 1) All shards performs faceting.
The Merger calculates the top-X terms.
Phase 2) The term counts are requested from the shards
that did not return them in phase 1.
The Merger calculates the final counts for the top-X terms.
for term: fineCountRequest.getTerms()
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))

27/55
Test setup 2 (more shards, smaller field)
l  Solr setup
-  9 shards @ 250M documents / 900GB
l  domain field
-  Single String value
-  1.1M unique terms per shard
l  1 concurrent “user”
l  Random search terms

28/55
Pit of Pain™ (or maybe “Horrible Hill”?)

29/55
Fine counting can be slow
Phase 1: Standard faceting
Phase 2:
result.add(term,
searcher.numDocs(query(field:term), base.getDocIDs()))

30/55
Alternative fine counting
counter.increment(ordinal)
result.add(term, counter.get(getOrdinal(term)))
}Same as phase 1, which yields
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3

31/55
Using cached counters from phase 1 in phase 2
counter = pool.getCounter(key)
for term: query.getTerms()
result.add(term, counter.get(getOrdinal(term)))

32/55
Pit of Pain™ practically eliminated

33/55
Pit of Pain™ practically eliminated
Stick figure CC BY-NC 2.5 Randall Munroe xkcd.com

34/55
Test setup 3 (more shards, more fields)
l  Solr setup
-  23 shards @ 250M documents / 900GB
l  Faceting on 6 fields
-  url: ~200M unique terms / shard
-  domain & host: ~1M unique terms each / shard
-  type, suffix, year: < 1000 unique terms / shard

35/55
1 machine, 7 billion documents / 23TB total index, 6 facet fields

36/55
High-cardinality can mean different things
Single shard / 250,000,000 docs / 900GB
Field References Max docs/term Unique terms
domain 250,000,000 3,000,000 1,100,000
url 250,000,000 56,000 200,000,000
links 5,800,000,000 5,000,000 610,000,000
2440 MB / counter

37/55
Different distributions
domain 1.1M url 200M links 600M
High max
Low max
Very long tail
Short tail

38/55
Theoretical lower limit per counter: log2(max_count)
max=1
max=7
max=2047
max=3
max=63

39/55
int vs. PackedInts
domain: 4 MB
url: 780 MB
links: 2350 MB
int[ordinals] PackedInts(ordinals, maxBPV)
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)

40/55
n-plane-z counters
Platonic ideal Harsh reality
Plane d
Plane c
Plane b
Plane a

41/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000

42/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001

43/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011

44/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101

45/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101
L: 4 ≣ 000111
L: 5 ≣ 001001
L: 6 ≣ 001011
L: 7 ≣ 001101
...
L: 12 ≣ 010111

46/55
Comparison of counter structures
domain: 4 MB
url: 780 MB
links: 2350 MB
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)
domain: 1 MB (30%)
url: 66 MB ( 8%)
links: 311 MB (13%)
int[ordinals] PackedInts(ordinals, maxBPV) n-plane-z

48/55
I could go on about
l  Threaded counting
l  Heuristic faceting
l  Fine count skipping
l  Counter capping
l  Monotonically increasing tracker for n-plane-z
l  Regexp filtering

49/55
What about huge result sets?
l  Rare for explorative term-based searches
l  Common for batch extractions
l  Threading works poorly as #shards > #CPUs
l  But how bad is it really?

51/55
Heuristic faceting
l  Use sampling to guess top-X terms
-  Re-use the existing tracked counters
-  1:1000 sampling seems usable for the field links,
which has 5 billion references per shard
l  Fine-count the guessed terms

52/55
Over provisioning helps validity

54/55
Web scale for a small web
l  Denmark
-  Consolidation circa 10th century
-  5.6 million people
l  Danish Net Archive (http://netarkivet.dk)
-  Constitution 2005
-  20 billion items / 590TB+ raw data

55/55
Never enough time, but talk to me about
l  Threaded counting
l  Monotonically increasing tracker for n-plane-z
l  Regexp filtering
l  Fine count skipping
l  Counter capping

Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark (20)

More from Lucidworks (20)

Recently uploaded (20)

Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark