SlideShare a Scribd company logo
- HyperLogLog and friends
S I M O N L I A - J O N A S S E N
F A S T F H L 2 0 2 0
 Counting distinct items
 Counting most frequent items
 Computing quantiles
 Computing joins
 …
 Memory-hungry
 May not parallelize well
Users visiting by day Thursday Friday
Site A 7.0M 7.5M
Site B 4.6M 4.4M
a Thu a Fri
b Thu
b Fri
 Were there 7.1M or 23.5M users in total?
 Streamable
 Sub-linear in size
 Approximate with a predictable error
 Mergeable / additive
 Highly parallelizable
https://yahooeng.tumblr.com/post/135390948446/data-sketches
 Exact detection
 Use a list, hash-set, or dictionary-size bit-set
 Require linear space
 To parallelize require partitioning or a shared dictionary
 Bloom filter
 Use an m bit vector (sub-linear) and k mutually independent hash functions
 Update - set bits, Query - check if all bits are set
 False-positive probability is known
 Can merge multiple filters
 Linear Counting
 We can use a hash into b < n bits
(assuming we know n)
 Update – flip a bit to 1
 Query – estimate from # of unset bits
http://dblab.kaist.ac.kr/Prof/pdf/ACM90_TODS_v15n2.pdf
 Let’s have a look at hash(x) using just 4 bits
Distribution of leading 0’s:
 1 – 50% (1 in 2)
 2 – 25% (1 in 4)
 3 – 12.5% (1 in 8)
 4 – 6.25% (1 in 16)
 We expect to see at least 8 random distinct elements to get 3 or more leading 0’s.
 So having max k leading 0’s, we expect having seen 2^k distinct elements.
 What if we hit 0000 early?
0000 0100 1000 1100
0001 0101 1001 1101
0010 0110 1010 1110
0011 0111 1011 1111
 What if we hit 0000 early?
 We could use many independent hash functions.
 LogLog
 Use m different buckets
 Log m is the number of bit to determine bucket
 Loglog H is the max number of bits per counter
 Approximate using 2^k_avg
 Std error is 1.30/sqrt(m)
https://blog.devartis.com/hyperloglogs-a-probabilistic-way-of-obtaining-a-sets-cardinality-3b5e6a982a12
 HyperLogLog
 Using harmonic mean, etc.
 Resulting in std error at 1.04/sqrt(m)
 Requires 64% less memory to match LogLog
 Variants:
 HyperLogLog++ (Google)
 Improves memory usage and estimation accuracy
for small cardinalities
 Java-HLL
 Uses different representations for empty, explicit,
sparse and full estimator sets.
Live Demo: http://content.research.neustar.biz/blog/hll.html
 Unions:
 Can merge any number of estimators.
 Intersections:
 Inclusion-exclusion principle.
 The accuracy is tricky.
https://cloud.google.com/blog/products/data-analytics/using-hll-speed-count-distinct-massive-datasets
 A brief comparison between StreamLib (S-) and Java-HLL (J-HLL) methods
 See http://s-j.github.io/hyperloglog/ (Feb 2014) for more numbers and details
 3 765 844 tokens
 2 074 012 unque keys - Sets.newHashSet(): 1195 ms
 (S- parameters were picked for 1% error with 10 mil keys)
method % error size time
S-LinearCounting 0.17 137 073 B 1 217 ms
S-LogLog (logm=14) 1.35 16 384 B 963 ms
S-HLL (logm=13) 1.81 5 472 B 1 000 ms
S-HLL++ (logm=13) -0.81 5 473 B 863 ms
J-HLL (logm=12 regw=5 Full Auto) -0.76 2 563 B 500 ms
J-HLL (logm=10 regw=5 Sparse Auto) -2.27 643 B 570ms
 Applications:
 Stream processing
 Distributed processing
 Batch processing
 Frameworks:
 Postgres, Hadoop, Presto, Redis, Druid …
 Kusto – dcount, dcount_hll, …
 Griffin – CardinalityEstimation
 An interesting open question:
 What about user retention?
https://clevertap.com/blog/cohort-analysis-user-retention/
https://yahooeng.tumblr.com/post/135390948446/data-sketches
THANK YOU!
Code:
 https://github.com/microsoft/CardinalityEstimation
 https://github.com/aggregateknowledge/java-hll
 https://github.com/addthis/stream-lib
 https://datasketches.apache.org/
Blog posts:
 http://s-j.github.io/hyperloglog/
 https://blog.devartis.com/hyperloglogs-a-probabilistic-way-of-obtaining-a-
sets-cardinality-3b5e6a982a12
 https://medium.com/@vinodhinic/hyperloglog-probabilistic-algorithm-
330ecbbc686c
 https://odino.org/my-favorite-data-structure-hyperloglog/
Papers:
 LinCnt: http://dblab.kaist.ac.kr/Prof/pdf/ACM90_TODS_v15n2.pdf
 LogLog: https://link.springer.com/chapter/10.1007/978-3-540-39658-1_55
 HLL: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
 HLL++: https://stefanheule.com/papers/edbt13-hyperloglog.pdf

More Related Content

PPTX
Applications of data structures
PPTX
Datastructures using c++
DOC
Io Summary
PPT
Introduction of data structure
PPTX
Ground Gurus - Python Code Camp - Day 3 - Classes
DOC
Data structure lecture 2
PPTX
Data types
Applications of data structures
Datastructures using c++
Io Summary
Introduction of data structure
Ground Gurus - Python Code Camp - Day 3 - Classes
Data structure lecture 2
Data types

What's hot (14)

PPTX
Datastrucure
PPT
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
PPTX
An Introduction To Python - Working With Data
PDF
Bloom Filters: An Introduction
PPTX
170120107066 dbms
PDF
PhD experience and skills
DOCX
Stacks
PDF
record_linking
PPTX
Java Arrays and DateTime Functions
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PDF
Python cheat-sheet
PPTX
Hashing and Hashtable, application of hashing, advantages of hashing, disadva...
PPTX
Data structure
PPTX
Sharbani bhattacharya VB Structures
Datastrucure
Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval
An Introduction To Python - Working With Data
Bloom Filters: An Introduction
170120107066 dbms
PhD experience and skills
Stacks
record_linking
Java Arrays and DateTime Functions
Data Science With Python | Python For Data Science | Python Data Science Cour...
Python cheat-sheet
Hashing and Hashtable, application of hashing, advantages of hashing, disadva...
Data structure
Sharbani bhattacharya VB Structures
Ad

Similar to HyperLogLog and friends (20)

PDF
Probabilistic data structures. Part 2. Cardinality
PDF
Count-Distinct Problem
PDF
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
PDF
Hyper loglog
PDF
Hyperloglog Project
PDF
Too Much Data? - Just Sample, Just Hash, ...
PDF
Large-scale real-time analytics for everyone
PDF
Distributed algorithms for big data @ GeeCon
PPTX
2013 open analytics_countingv3
PPTX
Tech talk Probabilistic Data Structure
PDF
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PPTX
Probabilistic data structures
PDF
2013 open analytics_countingv3
PPT
Approximate methods for scalable data mining
PDF
An introduction to probabilistic data structures
PDF
Probabilistic Data Structures and Approximate Solutions
PDF
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
PDF
HyperLogLog in Hive - How to count sheep efficiently?
PPTX
Data streaming algorithms
PDF
Count-min sketch to Infinity.pdf
Probabilistic data structures. Part 2. Cardinality
Count-Distinct Problem
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Hyper loglog
Hyperloglog Project
Too Much Data? - Just Sample, Just Hash, ...
Large-scale real-time analytics for everyone
Distributed algorithms for big data @ GeeCon
2013 open analytics_countingv3
Tech talk Probabilistic Data Structure
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic data structures
2013 open analytics_countingv3
Approximate methods for scalable data mining
An introduction to probabilistic data structures
Probabilistic Data Structures and Approximate Solutions
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
HyperLogLog in Hive - How to count sheep efficiently?
Data streaming algorithms
Count-min sketch to Infinity.pdf
Ad

More from Simon Lia-Jonassen (10)

PDF
Building successful and secure products with AI and ML
PPTX
No more bad news!
PPTX
Xgboost: A Scalable Tree Boosting System - Explained
PPTX
Chatbots are coming!
PDF
Large-Scale Real-Time Data Management for Engagement and Monetization
PDF
Efficient Query Processing in Web Search Engines
PDF
Leveraging Big Data and Real-Time Analytics at Cxense
PDF
Yet another intro to Apache Spark
PDF
Efficient Query Processing in Distributed Search Engines
PDF
What should be done to IR algorithms to meet current, and possible future, ha...
Building successful and secure products with AI and ML
No more bad news!
Xgboost: A Scalable Tree Boosting System - Explained
Chatbots are coming!
Large-Scale Real-Time Data Management for Engagement and Monetization
Efficient Query Processing in Web Search Engines
Leveraging Big Data and Real-Time Analytics at Cxense
Yet another intro to Apache Spark
Efficient Query Processing in Distributed Search Engines
What should be done to IR algorithms to meet current, and possible future, ha...

Recently uploaded (20)

PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Construction Project Organization Group 2.pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
PPT on Performance Review to get promotions
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
additive manufacturing of ss316l using mig welding
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Welding lecture in detail for understanding
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
web development for engineering and engineering
PPTX
Lecture Notes Electrical Wiring System Components
PPT
Project quality management in manufacturing
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
UNIT 4 Total Quality Management .pptx
Construction Project Organization Group 2.pptx
Structs to JSON How Go Powers REST APIs.pdf
PPT on Performance Review to get promotions
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
additive manufacturing of ss316l using mig welding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Internet of Things (IOT) - A guide to understanding
Welding lecture in detail for understanding
OOP with Java - Java Introduction (Basics)
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Lesson 3_Tessellation.pptx finite Mathematics
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
web development for engineering and engineering
Lecture Notes Electrical Wiring System Components
Project quality management in manufacturing
Embodied AI: Ushering in the Next Era of Intelligent Systems
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

HyperLogLog and friends

  • 1. - HyperLogLog and friends S I M O N L I A - J O N A S S E N F A S T F H L 2 0 2 0
  • 2.  Counting distinct items  Counting most frequent items  Computing quantiles  Computing joins  …  Memory-hungry  May not parallelize well
  • 3. Users visiting by day Thursday Friday Site A 7.0M 7.5M Site B 4.6M 4.4M a Thu a Fri b Thu b Fri  Were there 7.1M or 23.5M users in total?
  • 4.  Streamable  Sub-linear in size  Approximate with a predictable error  Mergeable / additive  Highly parallelizable https://yahooeng.tumblr.com/post/135390948446/data-sketches
  • 5.  Exact detection  Use a list, hash-set, or dictionary-size bit-set  Require linear space  To parallelize require partitioning or a shared dictionary
  • 6.  Bloom filter  Use an m bit vector (sub-linear) and k mutually independent hash functions  Update - set bits, Query - check if all bits are set  False-positive probability is known  Can merge multiple filters
  • 7.  Linear Counting  We can use a hash into b < n bits (assuming we know n)  Update – flip a bit to 1  Query – estimate from # of unset bits http://dblab.kaist.ac.kr/Prof/pdf/ACM90_TODS_v15n2.pdf
  • 8.  Let’s have a look at hash(x) using just 4 bits Distribution of leading 0’s:  1 – 50% (1 in 2)  2 – 25% (1 in 4)  3 – 12.5% (1 in 8)  4 – 6.25% (1 in 16)  We expect to see at least 8 random distinct elements to get 3 or more leading 0’s.  So having max k leading 0’s, we expect having seen 2^k distinct elements.  What if we hit 0000 early? 0000 0100 1000 1100 0001 0101 1001 1101 0010 0110 1010 1110 0011 0111 1011 1111
  • 9.  What if we hit 0000 early?  We could use many independent hash functions.  LogLog  Use m different buckets  Log m is the number of bit to determine bucket  Loglog H is the max number of bits per counter  Approximate using 2^k_avg  Std error is 1.30/sqrt(m) https://blog.devartis.com/hyperloglogs-a-probabilistic-way-of-obtaining-a-sets-cardinality-3b5e6a982a12
  • 10.  HyperLogLog  Using harmonic mean, etc.  Resulting in std error at 1.04/sqrt(m)  Requires 64% less memory to match LogLog  Variants:  HyperLogLog++ (Google)  Improves memory usage and estimation accuracy for small cardinalities  Java-HLL  Uses different representations for empty, explicit, sparse and full estimator sets. Live Demo: http://content.research.neustar.biz/blog/hll.html
  • 11.  Unions:  Can merge any number of estimators.  Intersections:  Inclusion-exclusion principle.  The accuracy is tricky. https://cloud.google.com/blog/products/data-analytics/using-hll-speed-count-distinct-massive-datasets
  • 12.  A brief comparison between StreamLib (S-) and Java-HLL (J-HLL) methods  See http://s-j.github.io/hyperloglog/ (Feb 2014) for more numbers and details  3 765 844 tokens  2 074 012 unque keys - Sets.newHashSet(): 1195 ms  (S- parameters were picked for 1% error with 10 mil keys) method % error size time S-LinearCounting 0.17 137 073 B 1 217 ms S-LogLog (logm=14) 1.35 16 384 B 963 ms S-HLL (logm=13) 1.81 5 472 B 1 000 ms S-HLL++ (logm=13) -0.81 5 473 B 863 ms J-HLL (logm=12 regw=5 Full Auto) -0.76 2 563 B 500 ms J-HLL (logm=10 regw=5 Sparse Auto) -2.27 643 B 570ms
  • 13.  Applications:  Stream processing  Distributed processing  Batch processing  Frameworks:  Postgres, Hadoop, Presto, Redis, Druid …  Kusto – dcount, dcount_hll, …  Griffin – CardinalityEstimation  An interesting open question:  What about user retention? https://clevertap.com/blog/cohort-analysis-user-retention/ https://yahooeng.tumblr.com/post/135390948446/data-sketches
  • 14. THANK YOU! Code:  https://github.com/microsoft/CardinalityEstimation  https://github.com/aggregateknowledge/java-hll  https://github.com/addthis/stream-lib  https://datasketches.apache.org/ Blog posts:  http://s-j.github.io/hyperloglog/  https://blog.devartis.com/hyperloglogs-a-probabilistic-way-of-obtaining-a- sets-cardinality-3b5e6a982a12  https://medium.com/@vinodhinic/hyperloglog-probabilistic-algorithm- 330ecbbc686c  https://odino.org/my-favorite-data-structure-hyperloglog/ Papers:  LinCnt: http://dblab.kaist.ac.kr/Prof/pdf/ACM90_TODS_v15n2.pdf  LogLog: https://link.springer.com/chapter/10.1007/978-3-540-39658-1_55  HLL: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf  HLL++: https://stefanheule.com/papers/edbt13-hyperloglog.pdf

Editor's Notes

  • #3: * Neither scale nor parallelize well
  • #10: Every hash has the same probability of occurring.
  • #11: Log log is the number of bits needed to compute
  • #12: Log log is the number of bits needed to compute
  • #13: Log log is the number of bits needed to compute