This document discusses probabilistic data structures that are useful for analyzing big data streams. It covers techniques like sampling, bloom filters, cuckoo filters, count-min sketch, t-digest, and hyperloglog that allow estimating statistics of large datasets in a memory and computationally efficient manner. These probabilistic structures trade exact answers for performance and can estimate things like frequencies, quantiles, set membership, and cardinality in sub-linear time and space. Real-world examples of applications in domains like analytics, anomaly detection, and distributed systems are also presented.
Related topics: