SlideShare a Scribd company logo
HyperLogLog
Samuel Ni
the problem to address
compute cardinality of a multiset
the problem to address
compute cardinality of a multiset
compute distinct elements in a data set with duplicated elements
e.g. there are 3 distinct elements in [a, b, a, c]
solution 1
len(hash_set(a_multi_set))
cons: out of memory for big data set
solution 2
sorted_data_on_disk = external_sort(a_multi_set)
count(sorted_data_on_disk)
cons: slowness
the problem to address
compute estimated cardinality of a very big multiset
Demo
How it works?
Most HyperLogLog explanations on the web
Hyper loglog
Some observations for evenly distributed numbers
Hyper loglog
Observation for evenly distributed numbers
estimate cardinality using the min value
cardinality ≈ max / min
Hyper loglog
Observation for evenly distributed numbers
estimate using the number of leading zeros in a number ever see
cardinality ≈ 2k
(where k is the biggest number of leading zeros found in a number)
cardinality ≈ 2k
(where k is the biggest number of leading zeros found in a number)
e.g. 232 => log(232) = max 32 leading zeros => log(32) = 5-bit counter
LogLog(232)
But what if our data set isn't evenly distributed integers?
hash functions
high error rate?
divided subsets
stochastic averaging
a couple more technicalities
correcting your estimate if it is below a certain amount, or if it is very large
use harmonic mean instead of the geometric mean
References
• https://github.com/sergeio/hyperloglog/blob/master/README.md
• http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-
Estimation
• http://antirez.com/news/75
• https://www.periscopedata.com/blog/hyperloglog-in-pure-sql.html
• https://stackoverflow.com/questions/12327004/how-does-the-
hyperloglog-algorithm-work
• http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
• http://opensourceconnections.com/blog/2015/02/04/its-log-its-log-
its-big-its-hyper-its-good/

More Related Content

PPT
4.4 external hashing
PDF
DBMS 9 | Extendible Hashing
PPT
Data structure lecture 4
PPT
Chap02alg
PPT
lecture 12
PDF
Hashing Algorithm
PPTX
Hashing
4.4 external hashing
DBMS 9 | Extendible Hashing
Data structure lecture 4
Chap02alg
lecture 12
Hashing Algorithm
Hashing

What's hot (19)

ZIP
Hashing
PPTX
Binomial Heaps and Fibonacci Heaps
PDF
Application of hashing in better alg design tanmay
PPTX
Hashing in datastructure
PDF
PPTX
B trees
PDF
Group p1
PPTX
Set Theory QA 3
PDF
Hashing notes data structures (HASHING AND HASH FUNCTIONS)
PPT
Hashing
PPTX
Introduction to Ultra-succinct representation of ordered trees with applications
PPT
Hashing
PPT
Hashing
PPT
Data Structure and Algorithms Hashing
PPT
18 hashing
PPT
4.4 hashing
PPTX
Hashing In Data Structure
PDF
PPTX
Hashing
Hashing
Binomial Heaps and Fibonacci Heaps
Application of hashing in better alg design tanmay
Hashing in datastructure
B trees
Group p1
Set Theory QA 3
Hashing notes data structures (HASHING AND HASH FUNCTIONS)
Hashing
Introduction to Ultra-succinct representation of ordered trees with applications
Hashing
Hashing
Data Structure and Algorithms Hashing
18 hashing
4.4 hashing
Hashing In Data Structure
Hashing
Ad

Viewers also liked (10)

PPTX
{'python': 'dict'}
PDF
ReqLabs PechaKucha Евгений Сафроненко
PDF
Big Data aggregation techniques
PPTX
Probabilistic data structures
PDF
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
PDF
Probabilistic data structures. Part 2. Cardinality
PPT
Using Simplicity to Make Hard Big Data Problems Easy
PDF
HyperLogLog in Hive - How to count sheep efficiently?
PDF
Анализ количества посетителей на сайте [Считаем уникальные элементы]
PDF
Deep dive into Coroutines on JVM @ KotlinConf 2017
{'python': 'dict'}
ReqLabs PechaKucha Евгений Сафроненко
Big Data aggregation techniques
Probabilistic data structures
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Probabilistic data structures. Part 2. Cardinality
Using Simplicity to Make Hard Big Data Problems Easy
HyperLogLog in Hive - How to count sheep efficiently?
Анализ количества посетителей на сайте [Считаем уникальные элементы]
Deep dive into Coroutines on JVM @ KotlinConf 2017
Ad

Similar to Hyper loglog (20)

PPTX
DA_02_algorithms.pptx
PPTX
Algorithms 101 for Data Scientists
PPTX
Programming data structure concept in array ppt
PPT
Stacksqueueslists
PPT
Stacks queues lists
PPT
Stacks queues lists
PPT
Stacks queues lists
PPT
Stack squeues lists
PPT
Stacks queues lists
PPTX
Algorithm, Concepts in performance analysis
PDF
Digital Systems Design Using Verilog 1st edition by Roth John Lee solution ma...
PPTX
Introduction to data structures and complexity.pptx
PDF
Algorithm review
PDF
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
PPTX
Introduction to Deep Learning and Tensorflow
PPTX
ADA_Module 2_MN.pptx Analysis and Design of Algorithms
PPTX
Parallel Distributive Computing Lecture 6
PPT
Parallel Computing 2007: Bring your own parallel application
PPTX
2.03.Asymptotic_analysis.pptx
PPTX
Data streaming algorithms
DA_02_algorithms.pptx
Algorithms 101 for Data Scientists
Programming data structure concept in array ppt
Stacksqueueslists
Stacks queues lists
Stacks queues lists
Stacks queues lists
Stack squeues lists
Stacks queues lists
Algorithm, Concepts in performance analysis
Digital Systems Design Using Verilog 1st edition by Roth John Lee solution ma...
Introduction to data structures and complexity.pptx
Algorithm review
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
Introduction to Deep Learning and Tensorflow
ADA_Module 2_MN.pptx Analysis and Design of Algorithms
Parallel Distributive Computing Lecture 6
Parallel Computing 2007: Bring your own parallel application
2.03.Asymptotic_analysis.pptx
Data streaming algorithms

Recently uploaded (20)

PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Construction Project Organization Group 2.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Foundation to blockchain - A guide to Blockchain Tech
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Current and future trends in Computer Vision.pptx
PPTX
Geodesy 1.pptx...............................................
PDF
PPT on Performance Review to get promotions
PDF
737-MAX_SRG.pdf student reference guides
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Construction Project Organization Group 2.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Fundamentals of safety and accident prevention -final (1).pptx
Internet of Things (IOT) - A guide to understanding
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
R24 SURVEYING LAB MANUAL for civil enggi
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Foundation to blockchain - A guide to Blockchain Tech
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
UNIT 4 Total Quality Management .pptx
Current and future trends in Computer Vision.pptx
Geodesy 1.pptx...............................................
PPT on Performance Review to get promotions
737-MAX_SRG.pdf student reference guides
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT

Hyper loglog

Editor's Notes

  • #19: count the-maximum-amount-of-trailing-zeroes-on-the-hash-of-each-value -- *gaasp* -- for each subset, and average them together, we can get much closer.  "stochastic averaging"
  • #20: count the-maximum-amount-of-trailing-zeroes-on-the-hash-of-each-value -- *gaasp* -- for each subset, and average them together, we can get much closer.  "stochastic averaging"