Solr sparse faceting

1/47
Solr sparse faceting
Everything counts in large amounts
@TokeEskildsen
State and University Library, Denmark
https://tokee.github.io/lucene-solr/

3/47
Nothing To Fear
● 500TB+ web resources from Danish Net Archive
● Estimated 50TB Solr index data when finished
● 3 machines of 16 CPU cores, 256GB RAM, 25 * 900GB SSD
– Each machine holds: 25 Solrs
● Each Solr holds: 1 optimized shard with 900GB / 250M docs
● Shards build externally, one at a time
● (Optimizations also relevant for smaller setups)

4/47
Pipeline
counter = new int[ordinals]
for docID: result.getDocIDs()
for ordinal: getOrdinals(docID)
counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++
priorityQueue.add(ordinal, counter[ordinal])
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3

7/47
Recycle
counter = pool.getCounter()
counter[ordinal]++
pool.release(counter)

11/47
Counting
counter[ordinal]++
pool.release(counter)
ord term counter
0 A 0
1 B 3
2 C 0
3 D 1006
4 E 1
5 F 1
6 G 0
7 H 0
8 I 3

12/47
ord counter
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
tracker
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A

13/47
ord counter
0 0
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A

14/47
ord counter
0 0
1 1
2 0
3 1
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

15/47
ord counter
0 0
1 1
2 0
3 2
4 0
5 0
6 0
7 0
8 0
tracker
3
1
N/A
N/A
N/A
N/A
N/A
N/A
N/A

16/47
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A

17/47
if counter[ordinal]++ == 0 && tracked < maxTracked
tracker[tracked++] = ordinal
if tracked < maxTracked
for i = 0 ; i < tracked ; i++
priorityQueue.add(tracker[i], counter[tracker[i]])
else
for ordinal = 0 ; ordinal < counter.length ; ordinal++
Sparse counting
ord counter
0 0
1 3
2 0
3 1006
4 1
5 1
6 0
7 0
8 3
tracker
3
1
8
4
5
N/A
N/A
N/A
N/A

19/47
Get the Balance Right
Phase 1) All shards perform faceting
The Merger calculates top-X terms
Phase 2) The term counts are requested from the shards
that did not return them in phase 1
for term: query.getTerms()
result.add(term, searcher.numDocs(
query(field:term), base.getDocIDs()
).hitCount)

21/47
Alternative fine counting
counter.increment(ordinal)
result.add(term, counter.get(getOrdinal(term)))
}Same as phase 1

22/47
Stripped
counter = pool.getCounter(key)
result.add(term, counter.get(getOrdinal(term)))

24/47
250,000,000 docs / 900GB, optimized
Field References Max docs/term Terms
domain 250,000,000 3,000,000 1,100,000
url 250,000,000 56,000 200,000,000
links 5,800,000,000 5,000,000 610,000,000

25/47
Term distributions
domain 1.1M url 200M links 600M

26/47
Clean
Title:/home/te/Dropbox/sb/net/pack_tra
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:26:18 2015

27/47
World Full Of Nothing
Creator:Dia v0.97.2
Creator:Dia v0.97.2
domain: 4 MB
url: 780 MB
links: 2350 MB
int[ordinals] PackedInts(ordinals, maxBPV)
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)

28/47
Creator:Dia v0.97.2
Creator:Dia v0.97.2
Platonic ideal Harsh reality
Plane 4
Plane 3
Plane 2
Plane 1
Construction Time Again

29/47
Creator:Dia v0.97.2
CreationDate:Wed May 20 22:38:49 2015Plane 4
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000

30/47
Creator:Dia v0.97.2
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
L: 1 ≣ 000001

31/47
Creator:Dia v0.97.2
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000010

32/47
Creator:Dia v0.97.2
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000010
L: 3 ≣ 000011

33/47
Creator:Dia v0.97.2
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000010
L: 3 ≣ 000011
L: 4 ≣ 000100
L: 5 ≣ 000101
L: 6 ≣ 000110
L: 7 ≣ 000111
...
L: 12 ≣ 001100

34/47
Creator:Dia v0.97.2
Plane 3
Plane 2
Plane 1
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000010
L: 3 ≣ 000011
L: 4 ≣ 000100
L: 5 ≣ 000101
L: 6 ≣ 000110
L: 7 ≣ 000111
...
L: 12 ≣ 001100
if counter[ordinal]++ == 0 && tracked < maxTracked
tracker[tracked++] = ordinal
?

35/47
Creator:Dia v0.97.2
Now This is Fun
Creator:Dia v0.97.2
Plane 1
Plane 2
Plane 3
Plane 4

36/47
Creator:Dia v0.97.2
Plane 1
Plane 2
Plane 3
Plane 4
L: 0 ≣ 000000

37/47
Creator:Dia v0.97.2
Plane 1
Plane 2
Plane 3
Plane 4
L: 0 ≣ 000000
L: 1 ≣ 000001

38/47
Creator:Dia v0.97.2
Plane 1
Plane 2
Plane 3
Plane 4
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011

39/47
Creator:Dia v0.97.2
Plane 1
Plane 2
Plane 3
Plane 4
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101

40/47
Creator:Dia v0.97.2
Plane 1
Plane 2
Plane 3
Plane 4
L: 0 ≣ 000000
L: 1 ≣ 000001
L: 2 ≣ 000011
L: 3 ≣ 000101
L: 4 ≣ 000111
L: 5 ≣ 001001
L: 6 ≣ 001011
L: 7 ≣ 001101
...
L: 12 ≣ 010111

41/47
The Bottom Line
Creator:Dia v0.97.2
Creator:Dia v0.97.2
Creator:Dia v0.97.2
domain: 4 MB
url: 780 MB
links: 2350 MB
domain: 3 MB (72%)
url: 420 MB (53%)
links: 1760 MB (75%)
domain: 1 MB (30%)
url: 66 MB ( 8%)
links: 311 MB (13%)
int[ordinals] PackedInts(ordinals, maxBPV) N-plane-z

43/47
Kitchen sink
250,000,000 docs / 900GB, optimized
Field References Max docs/term Terms
domain 250,000,000 3,000,000 1,100,000
url 250,000,000 56,000 200,000,000
links 5,800,000,000 5,000,000 610,000,000
x 9 shards
x 3 concurrent requests

44/47
Shouldn't Have Done That

45/47
Some Great Reward
8GB heap per
900GB shard

46/47
Dream On
● Threaded counting
● Monotonically increasing tracker for nplane-z
● Regexp filtering
● Fine count skipping
● Counter capping

Solr sparse faceting

More Related Content

What's hot (20)

Similar to Solr sparse faceting (12)

Recently uploaded (20)

Solr sparse faceting

Editor's Notes