Encoding survey

Encoding = (Data Structures) - (Data)
Rajeev Raman
University of Leicester
SPIRE 2015, King’s College London

Introduction Encoding Data Structures Asymptotically Optimal Encodings Minimal Encodings Conclusions
RMQ problem
Problem Statement
Given a static array A[1..n], pre-process A to answer queries:
RMQ(l, r) : return maxl≤i≤r A[i].
43 97 46 85 67 18 4524 8347 33 34
RMQ(5, 10) = 85.

Data Structuring Problems
This is a data structuring problem.
• Pre-process input data (here array A) to answer long series of
queries.
• Want to minimize:
1. Space usage of data structure.
2. Query time.
3. Time/space for pre-processing.
• In this talk we assume the input data is static i.e. it does not change
between queries.

Solution to RMQ Problem: Cartesian Tree
The Cartesian tree of A [Vuillemin CACM’80] is a binary tree.
43 97 46 33 85 67 18 4524 8347
97
47 85
43
18
45
83
24 67
33
34
34
46
• Place largest value at root of tree.
• Recurse on sub-arrays to left and right.
• RMQ is the lowest common ancestor (LCA) of interval endpoints.
• n-node binary tree can support LCA in O(n) space and O(1) time.
[Harel/Tarjan SICOMP’84]

Compressing RMQ
• O(n) space = O(n) words = Ω(n lg n) bits1
.
• Many applications where using O(n) words is way too much.
• Suﬃx tree on a string of n bits occupies O(n) words
• The same is true for many applications of RMQ.
• Can reconstruct A by asking RMQ(i, i) queries.
• In general A can’t be compressed below Ω(n lg n) bits.
• In speciﬁc applications (e.g. LCP array), A can be compressed, but
then accessing A[i] is slow.
Can we do better?
1lg = log2.

The RMQ Problem Redeﬁned
Given a static array A[1..n], pre-process A to answer queries:
RMQ(l, r) = arg max
l≤i≤r
A[i]
.
43 97 46 85 67 18 4524 8347 33 34
RMQ(5, 10) = 8.
Often the value of A[i] is not needed.

Encoding RMQ
RMQ(l, r) = arg max
l≤i≤r
A[i]
.
3
1 8
2
10
12
11
4 9
6
7
5
• Shape of Cartesian tree is enough to answer modiﬁed RMQ queries.
• A is not necessary!
• There are ≤ 4n
distinct binary trees on n nodes.
• Shape can be encoded in ≤ lg 4n
= 2n bits.
• Concrete encoding: 11 01 00 11 11 00 10 00 11 01 00 00.
• Data structures using 2n + o(n) bits, O(1) query time.
[Fischer/Heun SICOMP’11],[Davoodi et al. COCOON’12].

Encoding Data Structures
QUERY
RESULT
INPUT
PREPROC
Encoding
• Preprocess input data to answer a long series of queries.
• Preprocessing creates an encoding and deletes input.
Encodings = (Data Structures) − (Data)
• Queries only read encoding.
• Minimize: encoding size and query time.
• Non-trivial encodings must be smaller than original input data.

Encoding: Effective Entropy
Encoding ≡ determining effective entropy.
• Extensive literature on succinct and compressed data structures.
• Entropy: “information content of data.”
• Effective Entropy is “the information content of the data structure”
[Golin et al. TCS]:
• Given a set of objects S, a set of queries Q.
• Let C be the equivalence class on S induced by Q (x, y ∈ S are
equivalent if they cannot be distinguished by queries in Q).
A B
1 3 2 2 3 1
Arrays A and B cannot be distinguished by RMQ queries.
• We want to store x in lg |C| bits.
• Can define expected effective entropy as well.

Overview of Talk
• Overview of recent encoding results.
• Asymptotically optimal encodings
• Range Top-k [Grossi et al. ESA’13, Gawrychowski/Nicholson
ICALP’15]
• 2D Range Maximum [Brodal et al. Algor.’12][Brodal et al. ESA’13]
item Range Majority [Navarro/Thankachan CPM’14]
• Range Selection [Navarro et al. FSTTCS’14, GN ICALP’15]
• Range Maximum Sum Query [Nicholson/Gawrychowski, CPM ’15]
• 2D NLVs [Jo et al. WALCOM’15]
• Nondirectional NLV [Nicholson/Raman, CPM ’15]
• NLV + Range Max/Min [Jo/Satti, COCOON ’15]
• Minimal encodings
• RMQs [Fischer/Heun, SICOMP’11][Davoodi et al. PTRS-A ’14]
• Range Second Maximum [Davoodi et al. PTRS-A ’14]
• Bidirectional NLVs [Fischer, TCS’11]
• Range Min-Max [Gawrychowski/Nicholson, ICALP ’15]
• 2D Range Maximum, m = 2 [Golin et al. TCS]

Encoding Nearest Larger Values (NLV)
Problem Deﬁnition
Given array A[1..n] of distinct values, encode A to answer
NLV(i): return i s.t. A[j] > A[i] and |j − i| is minimized.
9 11 2 0 1 8 56 410 7 3
NLV(6) = 3
• Can obtain NLVs in both directions from Cartesian tree:
• Unfortunately, NLVs in both directions ≡ RMQ.

Unidirectional NLVs
NLV(i): return j s.t. A[j] > A[i] and |j − i| is minimized.
• Can we modify the Cartesian tree?
• Eliminate zig-zags!
• How many binary trees with no zig-zags of degree-1 nodes?

Counting Zig-Zag Free Binary Trees [Iacono]
• Change the encoding of degree-1 nodes:
01
1010
01
• Any encoding is a string over A = 01, B = 10, C = 00, D = 11.
• AA does not appear in the string.
• Number of strings of length n, S(n) satisﬁes:
S(n) = 3S(n − 1)
B,C,D
+ 3S(n − 2)
AB,AC,AD
• Gives log S(n) ∼ n · log((3 +
√
21)/2) ∼ 1.93n < 2n bits.
• Adding forbidden patterns AB∗
A gets ∼ 1.8999n bits.
• Easy to support operations.
• Same result obtained using a succinct Patricia trie, and much
optimization [Nicholson/Raman, CPM’15].

What’s the exact bound?
• Upper bound ∼ 1.89n.
• Lower bound by exhaustive enumeration ∼ 1.31n.
• Number of distinguishable configurations (equivalence classes):
n 1 2 3 4 5 6 7 8 9 10
# configurations 1 2 5 14 40 116 341 1010 3009 9012
This sequence is not in oeis.org.
• Counting up to n = 40 suggests rate of growth nO(1)
3n
giving
∼ n log 3 = 1.58n bits. [Hoffmann, personal communication.]

Encoding Range Selection
Problem Deﬁnition
Given A[1..n] and κ, encode A to answer the query:
select(k, l, r): return the position of the k-th largest value in A[l..r], for
any k ≤ κ.
• Non-encoding results by many authors including [Brodal and
Jørgensen, ISAAC’09] [Jørgensen/Larsen, SODA’11],
[Chan/Wilkinson, SODA’13].
• O(n log n) bits, O(lg k/ lg lg n) time [CW SODA’13], optimal time
for n(lg n)O(1)
bits of space [JL SODA’11].

Lower Bound on Encoding Size
Proposition
Any encoding for range selection must take Ω(n lg κ) bits.
Proof: The index can encode n/κ independent permutations over κ
elements ⇒ Ω((n/κ) · κ lg κ) bits = Ω(n lg κ) bits.
For example (κ = 3).
A = 3 1 2 2 3 1 1 2 3 · · ·
Can trivially recover A from its encoding.
select(2, 4, 6) = 4 ⇒ A[4] = 2.
κ must be known at construction time.

Encoding Range Selection [GN ’15]
Consider the 1-sided case: all queries of the form select(k, l, n). Example
assumes κ = 3.
0 9 3 4 2 5 6 8 1

assumes κ = 3.
0 9 3 4 2 5 6 8 1
8 0 4 3 3 2 1 0 0
• For each i, count # values to right that are greater.

assumes κ = 3.
0 9 3 4 2 5 6 8 1
3 0 3 3 3 2 1 0 0
• Cap all values to κ.
• Claim: we know the sorted order among all positions with counts
< κ.

assumes κ = 3.
0 9 3 4 2 5 6 8 1
3 0 3 3 3 2 1 0 0
• Cap all values to κ.
• Claim: we know the sorted order among all positions with counts
< κ.
• Positions = κ are never the answer to a select(k, l, n) query.
• We can answer select(k, l, n) queries using these counts which
occupy n log(κ + 1) bits.

Extend to the general 2-sided case.
• Let Sr be the 1-sided encoding for A[1 . . . r], for r = 1, . . . , n.
• Sr answers all queries of form select(k, l, r).
• δr+1 = # counts in encoding of Sr that are incremented to get Sr+1.

• Example, κ = 3, δ10 = 3:
0 9 3 4 2 5 6 8 1 7
3 0 3 3 3 3 2 0 1 0

• Example, κ = 3, δ10 = 3:
0 9 3 4 2 5 6 8 1 7
3 0 3 3 3 3 2 0 1 0
• Knowing δr+1 suﬃces to get Sr+1 from Sr .

• Example, κ = 3, δ10 = 3:
0 9 3 4 2 5 6 8 1 7
3 0 3 3 3 3 2 0 1 0
• Z = 0δ1
10δ2
1 . . . 0δn
1 is an encoding of all S1, . . . , Sn.
• Z has at most κn 0s and n 1s: there are ≤ (κ+1)n
n distinct Z’s.
• Encoding of size lg (κ+1)n
n
∼ n lg(κ + 1) + n lg e bits. This is
essentially optimal!

• Example, κ = 3, δ10 = 3:
0 9 3 4 2 5 6 8 1 7
3 0 3 3 3 3 2 0 1 0
• Z = 0δ1
10δ2
1 . . . 0δn
1 is an encoding of all S1, . . . , Sn.
• Z has at most κn 0s and n 1s: there are ≤ (κ+1)n
n distinct Z’s.
• Encoding of size lg (κ+1)n
n
∼ n lg(κ + 1) + n lg e bits. This is
essentially optimal!
• Query time: O(κ6
(log n)2+
) vs. O(log k/ log log n).

Encoding Range Selection: Fast DS
• View A geometrically in 2D:
A[i] = y ⇒ (i, y).
• Use idea of shallow cutting for
top-k [JL SODA’11].
• Take set of n given points and
decompose into O(n/κ) slabs
each containing O(κ) points
such that:
• For any 2-sided query
select(l, r) ∃ slab such that it
and two other adjacent slabs
contain the top κ elements
in A[l..r].
000
111

Encoding Range Selection: Fast DS
Create κ-shallow cutting. For O(κ) points in each slab, store range
selection DS: O(κ lg κ) bits, or O(n lg κ) bits (asymptotically optimal).
1. Find resolving slab for given query [Grossi et al. ESA 13].
2. Use slab’s range selection data structure to answer query.
• Slab’s points are numbered 1..O(κ), input query and answer are in
1..n.
• Storing global coordinates of points in a slab takes O(κ lg n) bits per
slab or O(n lg n) bits overall.
3. Develop a representation of slabs which can space-eﬃciently:
3.1 in O(lg κ/ lg lg n) time, perform predecessor search for l and r among
x coordinates in a slab.
• Map query range to range among slab’s points.
3.2 in O(1) time, retrieve the i-th largest x-coordinate in the slab.
• Convert answer back to “global” coordinates.
Theorem [Navarro et al. FSTTCS’14]
There is an encoding using O(n lg κ) bits of space and supports range
selection in O(lg k/ lg lg n) time.

2D NLV
Problem Statement
Given an n × n matrix A, preprocess to answer:
NLV(p) : if p = (i, j), return q = (i , j ) s.t. A[q] > A[p] and
|p − q|1 = |i − i | + |j − j | is minimized.
0
1
2
3
4
5
0 1 2 3 4 5
If elements of A are distinct, explicitly store pointers (length i pointer in
O(lg i) bits), overall O(n2
) bits. [Jaypaul et al. IWOCA’14] Jaypaul et al.
gave O(n2
lg lg n) bit encoding.

2D NLV
Problem Statement
Given an n × n matrix A, preprocess to answer:
NLV(p) : if p = (i, j), return q = (i , j ) s.t. A[q] > A[p] and
|p − q|1 = |i − i | + |j − j | is minimized.
0
1
2
3
4
5
0 1 2 3 4 5
Can’t point directly to answer when elements of A are non-distinct: this
requires Ω(n2
lg n) bits, which is uninteresting.
Jaypaul et al. gave O(n2
lg lg n) bit encoding.

Encoding 2D NLV
Theorem [Jo et al. WALCOM’15]
There is an encoding of NLVs of a 2D matrix A that uses O(n2
) bits and
answers queries in O(1) time, even when elements of A are not distinct.
• Encoding idea is simple:
• Suppose wlog that NLV(p) = q is to the right and above p. If there
is a position p to the right of p in p’s row but not to the right of q,
then p points to p . Else, look for p above p in column. If neither
p nor p exist then point to q.
• 1D NLV problem closely related to RMQ problem.
• Encoding 2D-RMQ requires Ω(n2
lg n) bits [Demaine et al.
ICALP’09].

Minimal Encodings
1. Pre-process given data to obtain encoding E, discard input.
2. E should precisely characterize the query – # distinct Es should
equal # distinguishable data instances using the query (|C|).
3. Create succinct DS on E, using lg |C|(1 + o(1)) bits. Second
pre-processing should not access input.
0000
0000000000000000
0000000000000000
00000000000000000000
0000
1111
1111111111111111
1111111111111111
11111111111111111111
1111
INPUT
QUERY
RESULT
PREPROC
Encoding
PREPROCDS
Advantages
• Optimal space.
• Only information in DS is what can be obtained from queries.
• “Minimal-knowledge” data structures: contain only information
strictly necessary to answer queries.

Minimal Encodings for RMQ
Problem Deﬁnition
Given A[1..n], preprocess to answer:
RMQ(l, r) : return arg maxl≤i≤r A[i].
3
1 8
2
10
12
11
4 9
6
7
5
• Shape of Cartesian tree precisely describes all possible RMQs.
[Fischer, Heun, SICOMP’11].
• Pre-process A, output Cartesian tree, delete A.

Minimal Encodings for R2MQ
Problem Deﬁnition
Given A[1..n], encode A to answer:
R2MQ(l, r): return arg maxi∈{l,...,r}−RMQ(l,r) A[i].
[10]
[1] [6]
[1]
[1]
[1]
[3]
[1] [1]
[1]
[1]
[3]
• Need to merge inner spines of Cartesian tree.
• Precisely described by “extended Cartesian tree”.
• Space needed is asymptotically ∼ 2.76n bits [Gawrychowski and
Nicholson, ICALP’15].

Minimal Encodings for the Bidirectional NLV Problem
Problem Definition
BNLV(i): return j > i such that A[j] > A[i] and j − i is minimized,
and j < i such that A[j ] > A[i] and i − j is minimized.
3 7 2 4 4 8 54 34 4 3
• When A has distinct values, this is just
Cartesian trees.
• When A has equal values, described by a
subclass of Schröder trees [Fischer, TCS’11].
• Number of n-node Schröder trees is
≤ (3 + 2
√
2)n
< 22.54n
.
• Encoding using < 2.54n bits.

Minimal Encodings for Range Min-Max Queries
Problem Deﬁnition
Range-Min-Max(l, r): return both arg maxi∈{l,...,r} A[i] and
arg mini∈{l,...,r} A[i].
Minimal encoding by [Gawrychowski and Nicholson, ICALP’15]:
• Precisely characterized by Baxter permutations.
• Do not exist 1 ≤ l < i < r ≤ n such that:
π(i + 1) < π(l) < π(r) < π(i) (2 − 41 − 3)
or
π(i) < π(r) < π(l) < π(i + 1) (3 − 14 − 2)
• If A is a Baxter permutation, it can be recovered using
Range-Min-Max queries.
• Number of Baxter permutations on [n] = 23n
/nO(1)
, gives
3n − O(lg n) encoding size.

Conclusions and Open Problems
Conclusions:
• Introduced the notion of encoding DS.
• Minimal encodings are combinatorially interesting and have good
privacy properties.
Wide range of open problems:
• Challenging data structuring open problems:
• Asymptotically optimal 2D RMQ encoding of [Brodal et al. ESA’13]
does not support eﬃcient 2D RMQ queries.
• Optimal top-k encoding of [Gawrychowski and Nicholson ICALP’15]
does not support eﬃcient queries.
• Determining minimal encodings for a number of problems.
• Pre-processing time — ideally want O(n) time preprocessing.
• Apply encoding DS to reducing the space usage of “normal” DS. [cf.
Chan and Wilkinson, SODA’13]

Encoding survey

More Related Content

What's hot (15)

Viewers also liked (20)

Similar to Encoding survey (20)

Recently uploaded (20)

Encoding survey