Count-Min Tree Sketch : Approximate counting for NLP tasks

www.exensa.com
www.exensa.com
PRESENTER: GUILLAUME PITEL 2016 JUNE 9Approximate counting for NLP
Count-Min Tree Sketch
Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand, Abdul
Mouhamadsultane
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
b=2/c=110 b=4/c=01011001
conflict
between
counters
4 and 7

www.exensa.com
A bit of context
Why do we need to count ?
Data analysis platform : eXenGine.
Processes different kind of data (mostly text).
We need to create relevant cross-features : to do that we need to count occurrences of all possible
cross-features. In the case of text data, a particular kind of cross-feature is known as n-grams.
There are many different measures to decide if a n-gram is interesting. All require to count the
occurrences of the cross-feature and the features themselves (i.e. count bigrams and words in
bigrams)
Counting exactly is easy, distributable, and very slow because of memory usage. Also, having the
whole data structure containing the counts in memory is impossible, so one has to resort to using
huge map/reduce with joins to do the job.

www.exensa.com
A bit of context
What kind of data are we talking about ?
Google N-grams
tokens 1024 Billions
sentences 95 Billions
1-grams (count > 200) 14 Millions
2-grams (count > 40) 314 Millions
3-grams 977 Millions
4-grams 1.3 Billion
5-grams 1.2 Billion

www.exensa.com
A bit of context
What kind of data are we talking about ?
Zipfian distribution
[Le Quan & al. 2003]

www.exensa.com
A bit of context
What kind of measures are we talking about ?
PMI, TF-IDF, LLR

www.exensa.com
A bit of context
Summary / Goals
Many
counts
Logarithms
in measures
We need to store
a large amount of
counts
We care about
the order of
magnitude
Fast and memory
controlled
We don’t want a
distributed memory for
the counts
Zipfian
counts
Many very small
counts that will be
filtered out later

www.exensa.com
A bit of context
Summary / Goals
Many
counts
Logarithms
in measures
We need to store
a large amount of
counts
We care about
the order of
magnitude
Fast and memory
controlled
We don’t want a
distributed memory for
the counts
Zipfian
counts
Many very small
counts that will be
filtered out later
We can use probabilistic
structures

www.exensa.com
Count-Min Sketch
A probabilistic data structure to store counts [Cormode & Muthukrishnan 2005]

www.exensa.com
Count-Min Sketch
A probabilistic data structure to store counts
Conservative Update :
improve CMS by updating
only min values

www.exensa.com
Count-Min Log Sketch
A probabilistic data structure to store logarithmic counts
[Pitel & Fouquier, 2015] : same idea than [Talbot, 2009] in a Count-min Sketch
Instead of using regular 32 bit counters, we use 8 or 16 bits “Morris” counters counting
logarithmically.
Since counts are used in logs anyway, the error on the PMI/TF-IDF/… is almost the same, but we can
use more counters
However, a count of 1 still uses the same amount of memory than a count of 10000. Also, at some
point, error stops improving with space (there is an inherent residual error)

www.exensa.com
A count min sketch with shared counters
Idea : use a hierarchal storage where most significant bits are shared
between counters.
Somehow similar to TOMB counters [Van Durme, 2009], except that
overflow is managed very differently.

www.exensa.com
Tree Shared Counters
Sharing most significant
bits
8 counters structure
o A tree is made of three kinds of storage:
o Counting bits
o Barrier bits
o Spire (not required except for
performance)
oSeveral layers alternating counting
and barrier bits.
oHere we have a
<[(8,8),(4,4),(2,2),(1,1)],4> counter
Or : how can we store counts with an average approaching
4 bits / counter
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
barrier bits
counting bits
spire
base layer

www.exensa.com
Sharing most significant
bits
8 counters structure
o8 counters in 30 bits + spire
oWithout a spire, n bits can count up
to 3 × 21+log2
𝑛
4
o Many small shared counters with spires
are more efficient than a large shared
counter
Or : how can we store counts with an average approaching
4 bits / counter
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
barrier bits
counting bits
spire
base layer

www.exensa.com
Reading values
o A counter stops at the first ZERO barrier
o When two barrier paths meet, there is
a conflict
o Barrier length (b) is evaluated in unary
o Counter bits (c) are evaluated in a more
classical way
0
1
1 0 1 0
1 1 1 1
1 0
0 1
1
1
1
0
0
0
1
1
1
1
0
0
0
0
1
1
0101
b=2/c=110 b=4/c=01011001
conflict
between
counters
4 and 7

www.exensa.com
Incrementing (counter 5)
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0000
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0000
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
0 1 2

www.exensa.com
0
0
0 0 0 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0000
0
0
0 0 1 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
0
0
0 0 1 0
0 0 0 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0000
3 4 5

www.exensa.com
0
0
0 0 0 0
0 0 1 0
0 0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0000
6
1
A bit at that level is worth …
2
2
4
4
8

www.exensa.com
Count-Min Tree Sketches
Experiments
Results !
• 140M tokens from English Wikipedia*
• 14.7M words (unigrams + bigrams)
• Reference counts stored in UnorderedMap  815MiB
Perfect storage size : suppose we have a perfect hash function and store the counts using 32-bits
counters. For 14.7M words, it amounts to 59MiB.
Performance : our implementation of a CMTS using <[(128,128),(64,64)…],32> counters is equivalent to native
UnorderedMap performance.
We use 3-layers sketches (good performance/precision tradeoff)
* We preferred to test our counters with a large number of parameters rather than with a large
corpus, so we limit to 5% of Wikipedia.

www.exensa.com
Average Relative Error
Results !

www.exensa.com
RMSE
Results !

www.exensa.com
RMSE on PMI
Results !

www.exensa.com
Question : are CMTS really useful in real-life ?
1 – CMTS are better on the whole vocabulary, but what happens if we
skip the least frequent words / bigrams ?
2 – CMTS are better on average, but what happens quantile by quantile ?

www.exensa.com
PMI Error per quantile
(sketches at 50% perfect
size, limit eval to f > 10-7
)
Results !

www.exensa.com
Relative Error per log2-quantile
(sketches at 50% perfect size,
limit eval to f > 10-7 )
Results !

www.exensa.com
Conclusion
Where are we ?
CMTS significantly outperforms other methods to store and update Zipfian counts in a very efficient
way.
Because most of the time in sketch accesses is due to memory access, its performance is on-par with
other methods
• Main drawback : at very high (and unpractical anyway) pressures (less than 10% of the perfect storage
size), the error skyrockets
• Other drawback : implementation is not straightforward. We have devised at least 4 different ways to
increment the counters.
Merging (and thus distributing) is easy once you can read and set a counter.

www.exensa.com
Conclusion
Where are we going ?
Dynamic : we are working on a CMTS version that can automatically grow (more layers added below)
Pressure control : when we detect that pressure becomes too high, we can divide and subsample to
stop the collisions to cascade
Open Source python package on its way

Count-Min Tree Sketch : Approximate counting for NLP tasks

More Related Content

Viewers also liked (18)

Similar to Count-Min Tree Sketch : Approximate counting for NLP tasks (20)

Recently uploaded (20)

Count-Min Tree Sketch : Approximate counting for NLP tasks