SPIRE2013-tabei20131009

20th String Processing and Information Retrieval (SPIRE2013),
Jerusalem, Israel, October 9th, 2013

Fully-Online Grammar
Compression
Yasuo Tabei (PREST, JST)
Collaboration with
Shirou Maruyama (PFI, Inc)
Hiroshi Sakamoto (Kyutech)
Kunihiko Sadakane (NII)

Motivation
• Large-scale and highly repetitive text collections
have become ubiquitous
– Personal genomes, version controlled documents,
source codes in repository, reports by studentsnew

• Repair = representative grammar compression
– Not applicable to large-scale repetitive texts

• Present a scalable grammar compression

Straight Line Program (SLP)
• Canonical form of a CFG deriving a single string
• Every production rule satisfies
– Right-hand side is a digram
– Subscripts of the left symbol is larger than subscripts
of the right symbols
X5

Example:
aabbabb

X1➝ab
X2➝X1a
X3➝X1X2
X4➝X3X2

X2
a

X4
X1

ab

b

X3

b

X1
ab

Straight Line Program (SLP)
• Canonical form of a CFG deriving a single string
• Every production rule satisfies
– Right-hand side is a digram
– Subscripts of the left symbol is larger than subscripts
of the right symbols
X5

Example:
aabbabb
N:text length

X1➝ab
X2➝X1a
n
X3➝X1X2
X4➝X3X2

X2
a

X4
X1

ab

h:
b height

X3

b

X1
ab

Grammar Compression (GC)
• Build a small CFG from an input string
– Size n = number of production rules

• Two crucial data structures
1. Dictionary : Given Xk, returns XiXj for Xk ➝ XiXj
- Array : 2nlgn bits
2. Reverse dictionary: Given XiXj, return Xk
- Hash table : O(nlgn) bits
X1➝ab
X2➝X1a
X3➝X1X2
X4➝X3X2

Access : Xk ➝ A[2k-1][2k]

Existing grammar compression
• Compression time and working space are
important for scalability
• Online LCA (OLCA) [CCP,2011] = efficient GC
Compression
Method
time
CCP,2011 O(N/α)
SPIRE,2012 O(N/α)
CPM,2013 O(Nlgn)

Working space (bits)
(3+α)nlgn
(11/4+α nlgn
2nlgn(1+o(1))+2nlgp (p << √n)

• Drawbacks : they need a large working space
• Challenge : developing fast GC of smaller
working space

Fully-Online LCA (FOLCA)
Direct encoding of an SLP
SLP (Parse Tree)
Text
abaababa

Partial Parse Tree

Succinct
Representation
12345678910
B:0010101011
L:abaX1X2
P:123469

• Smaller working space : (1+α)nlgn+n(3+lg(αn))
bits
• Optimal encoding: lgn+2n+o(n) bits
– Almost equal to the lower bound [CPM,2013]

Menu
• Review of Online LCA
• FOLCA

• Compressed hash table for smaller working space
• Substring extractions
• Experiments

Basic idea of OLCA
• Replace the same pairs of symbols in common
substrings by the same non-terminal symbols as
many as possible
• Build 2-trees or 2-2-trees
X2
X1

X2

X2
X3

X4

X1

X3

X4

X1

a b r a k a d a b r a k a d a b r
common substrings

• Iterate this procedure to novel non-terminal
symbols until it builds a single parse tree

Land mark : local feature decided by
a triple of symbols ABC
• B is a landmark if B belongs to one of the
following : i) repetitive: A = B = C, ii) maximum:
A < B > C, iii) minimum: A > B < C
• Enable an bottom up construction of a parse tree
in an online manner
• Build a parse subtree from a sequence of
symbols of length four
i)B is a landmark
Z

ii) Otherwise
Z

Y

ABCD
A B C

D

Online construction of a parse tree
• Use a queue corresponding to each level of a parse tree
• Read a character, build a subtree in each queue, and
enqueue a non-terminal symbol of the root to the higher
queue
(i) q1 is land mark

enqueue

z

Qi+1

(ii) Otherwise

z

Qi+1
z

z

y
Qi q0 q1 q2 q3
q0q1

Qi q0 q1 q2 q3
dequeue

q0q1q2

enqueue

dequeue

Demonstration of OLCA
Q3
d

X3

1

2

d

X1 X1

b X2

1

2

4

Rules
X1→aa
X2→ab
X3→X1X1

3

4

5

Q2
3

5

Q1

Input string

d
1

a a a a b a b a a a a b
2

3

4

5

Courtesy by Shirou Maruyama

Efficiency of OLCA
• The approximation ratio : O(lg2N)
• Compression time : O(N/α)

• Working space : (3 + α)nlgn bits
• Parse tree is balanced and its height is h =
O(lgN)

Fully-Online LCA (FOLCA)
• Build post-order partial parse tree (POPPT)
– Partial parse tree whose internal nodes have postorder variables
Parse tree

POPPT

• Enable direct encoding to a post-order
succinct tree : nlgn + 2n + o(n) bits

Online construction of POPPT
• A replacing pair in queues are shifted to the right
position of OLCA
(i) q1 is land mark

enqueue

z

Qi+1

(ii) otherwise

enqueue

z

Qi+1
z

z

y
Qi q0 q1 q2 q3 q4
q0q1

Qi q0 q1 q2 q3 q4
dequeue

q0q1q2

dequeue

• Approximation ratio is the same as that of OLCA

Succinct encoding of POPPT
• FOLCA builds POPPT in an online manner, it
encodes the POPPT into dynamic RMM tree
[Sadakane and Navarro,2009]
– ‘0’ for a leaf and ‘1’ for an internal node
– L : a label sequence for leaves
POPPT

Succinct tree
B : 0010101011
L : abaX1X2

nlgn + 2n + o(n) bits
• Simulate tree operations using rank/select
dictionary : random access to Xk ➝ XiXj

Compression of reverse dictionary :
Given XiXj, it returens Xk for Xk➝XiXj
• Implemented as chaining hash table
– αnlgn bits for the table, n(1+α)lgn bits for the lists (α:
load factor of hash table)

• Observation : FOLCA generates post-order
variables in increasing order
– Variables in each list can be organized in increasing
order.

• Compress each list by gap-encoding and the
delta code
• Space : (1+α)nlgn + n(3+lg(αn)) bits
• Access time : O(1/α)

Substring extraction
• Keep the starting position of the substring
encoded by each variable Xi in position array P
– Naïve representation : nlgN bits

• Observation : position array is a monotonically
increasing sequence [Grossi et al., 2003]
• nlg(N/n)+3n+o(n) bits
• Extraction time of a substring
of length l is O(l+h)

P
Increasing

Experiments
• Ecoli (108MB) and kernel texts (247MB) from
repetitive collections in pizza & chili corpus
• Evaluate compression time, working space and
substring extraction time
• Compare FOLCA with LZend [Kreft and
Navarro’10]
• Applicability to 100 human genomes (300GB)

Compression time and working
space for the Ecoli text
FOLCA: Spaces for hash table (H) dictionary (D)
and position array (P)
load
H+D
H+D+P
factor
time (sec) H (MB) (MB)
(MB)
0.01
1,328
23
45
50
0.05
728
37
59
64
0.1
553
48
70
75
0.3
416
65
87
92
0.5
408
90
112
117
LZend
time (sec) space (MB)
2,217
2,410

Compression time and working
space for the kernel text
FOLCA: Spaces for hash table (H) dictionary (D)
and position array (P)
load
H+D
H+D+P
factor
time (sec) H (MB) (MB)
(MB)
0.01
2,891
11
21
23
0.05
2,071
13
23
25
0.1
1,472
16
26
28
0.3
951
30
40
42
0.5
882
42
52
54
LZend
time (sec) space (MB)
4,547
4,653

Substring extraction time and
working space for the kernel text
Time [sec]
Length
101
102
103
104
105

FOLCA

LZend

0.00007
0.00026
0.00224
0.02176
0.21328

0.00002
0.00011
0.00100
0.00954
0.09215

Working space [MB]
FOLCA
12

LZend
14

Compression size for 100 human
genomes (300GB)

Compression time for 100 human
genomes (300GB)

Summary of FOLCA
• Directly encode an SLP into a succinct
representation of nlgn+2n+o(o) bits
• Asymptotically equivalent to the information
theoretic lower bound [CPM,2013]
• Compressed hash table for small working space
of (1+α)nlgn+n(3+lg(αn)) bits
• Support substring extraction in O(l+h) time using
additional space of nlg(N/n)+3n+o(n) bits

SPIRE2013-tabei20131009

More Related Content

What's hot (20)

Similar to SPIRE2013-tabei20131009 (20)

More from Yasuo Tabei (17)

Recently uploaded (20)

SPIRE2013-tabei20131009