SlideShare a Scribd company logo
6
Most read
8
Most read
Mining High-Speed Data Streams
Davide Gallitelli
Politecnico di Torino – TELECOM ParisTech
@DGallitelli95
Mining High-Speed Data Streams 1
Pedro Domingos
University of Washington
Geoff Hulten
University of Washington
1. Introduction 2
Huge and Fast data streaming
1. Introduction 3
KDD systems
operating
continuously
and indefinitely
Limited by:
• Time
• Memory
• Sample Size
SPRINT
Tested on up to
a few million
examples.
Less than a
day’s worth!
41. Introduction
VERY
FAST
DECISION
TREE
Hoeffding Decision Tree
2. Hoeffding Trees 5
2. Hoeffding Trees 6
 Classical DT learners are limited by main memory size
 Probably, not all examples are needed to find the best attribute at a node
 How to decide how many are necessary? Hoeffding Bound!
«Suppose we have made 𝑛 independent observations of a variable 𝑟 with
domain 𝑅, and computed their mean 𝑟. The Hoeffding bound states that,
with probability 1 − 𝛿, the true mean of the variable is at least 𝑟 − 𝜖»
2. Hoeffding Trees 7
How many examples are enough?
• Let 𝐺 𝑋𝑖 be the heuristic measure of choice (Information Gain, Gini Index)
• 𝑋 𝑎 : the attribute with the highest attribute evaluation value after n examples
• 𝑋 𝑏 : the attribute with the second highest split evaluation function value after n
examples
• We can compute
∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖
• Thanks to Hoeffding Bound, we can infer that:
• ∆𝐺 ≥ ∆ 𝐺 − 𝜖 > 0 with probability 1 − 𝛿, where ∆𝐺 is the true difference in
heuristic measure
• This means that we can split the tree using 𝑋 𝑎, and the succeeding examples
will be passed to the new leaves (incremental approach)
82. Hoeffding Trees
• Compute the heuristic measure
for the attributes and determine
the best two attributes
• At each node chack for the
condition
∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖
• If true, create child nodes based
on the test at the node; else, get
more examples from stream.
HT Algorithm
2. Hoeffding Trees 9
In a nutshell
• Learning in Hoeffding tree is constant time per example (instance) and
this means Hoeffding tree is suitable for data stream mining.
• Requires each example to be read at most once (incrementally built).
• With high probability, a Hoeffding tree is asymptotically identical to the
decision tree built by a batch learner.
𝐸 ∆𝑖 𝐻𝑇𝛿, 𝐷𝑇∗ ≤
𝛿
𝑝
• Independent of the probability
distribution generating the observations
• Built incrementally by sequential reading
• Make class predictions in parallel
• What happens with ties?
• Memory used with tree expansion
• Number of candidate attributes
goo.gl/gBnm9h
goo.gl/QvZMC7
VFDT
3. VFDT System 10
113. VFDT System
VFDT (Very Fast Decision Tree)
• Hoeffding tree algorithm implementation is VFDT
• VFDT includes refinements to the HT algorithm:
• Tie-braking algorithm
• Recompute G after a user-defined #examples
• Deactivation of inactive leaves
• Drop of unpromising early attributes (if ∆𝐺 > 𝜖)
• Bootstrap with traditional learner on a small
subset of data
• Rescan of previously-seen examples
123. VFDT System
Comparison with C4.5
𝛿 = 10−7
𝜏 = 5%
𝑛 𝑚𝑖𝑛 = 200
134. Application
A VFDT application : Web Data
• Mining the stream of Web page requests emanating
from the whole University of Washington main
campus.
• Useful to improve Web Caching, by predicting which
hosts and pages will be requested in the near future.
145. Conclusion
Future Work
• Test other applications (such as Intrusion detection)
• Use of non-discretized numeric attributes
• Use of post-pruning
• Use of adaptive δ
• Compare with other incremental algorithms (ID5R or SLIQ/SPRINT)
• Adapt to time-changing domains (concept drift)
• Parallelization
5. Conclusion 15
QUESTIONS?
5. Conclusion 16
THANK YOU!

More Related Content

PPTX
Cassandra internals
PDF
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
PDF
Iceberg: a fast table format for S3
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PDF
6. Association Rule.pdf
PDF
Mining Frequent Patterns And Association Rules
PDF
Parquet - Data I/O - Philadelphia 2013
PDF
A Technical Introduction to WiredTiger
Cassandra internals
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks
Iceberg: a fast table format for S3
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
6. Association Rule.pdf
Mining Frequent Patterns And Association Rules
Parquet - Data I/O - Philadelphia 2013
A Technical Introduction to WiredTiger

What's hot (20)

PPTX
elasticsearch_적용 및 활용_정리
PDF
Side by Side with Elasticsearch & Solr, Part 2
PPTX
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
PDF
3. Regression.pdf
PPTX
Securing Hadoop with Apache Ranger
PPTX
Word representations in vector space
PDF
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
PDF
Deep Dive into Cassandra
PDF
Streaming SQL with Apache Calcite
PPTX
Apache Spark Architecture
PPT
腾讯大讲堂06 qq邮箱性能优化
PDF
5. Types of Clustering Algorithms in ML.pdf
PDF
Deep Dive: Memory Management in Apache Spark
PPTX
Moving Beyond Lambda Architectures with Apache Kudu
PPT
5.1 mining data streams
PDF
Spark with Delta Lake
PDF
Parquet Hadoop Summit 2013
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PPT
2.1 Data Mining-classification Basic concepts
elasticsearch_적용 및 활용_정리
Side by Side with Elasticsearch & Solr, Part 2
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
3. Regression.pdf
Securing Hadoop with Apache Ranger
Word representations in vector space
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Deep Dive into Cassandra
Streaming SQL with Apache Calcite
Apache Spark Architecture
腾讯大讲堂06 qq邮箱性能优化
5. Types of Clustering Algorithms in ML.pdf
Deep Dive: Memory Management in Apache Spark
Moving Beyond Lambda Architectures with Apache Kudu
5.1 mining data streams
Spark with Delta Lake
Parquet Hadoop Summit 2013
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
2.1 Data Mining-classification Basic concepts
Ad

Similar to Mining high speed data streams: Hoeffding and VFDT (20)

PPT
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
PDF
MSR 2009
PDF
Online machine learning in Streaming Applications
PPTX
Performance Issue? Machine Learning to the rescue!
PPTX
Data Stream Algorithms in Storm and R
PDF
Modern Computing: Cloud, Distributed, & High Performance
PDF
Mining data streams using option trees
PPT
Lecture 1
PPT
Nbvtalkatjntuvizianagaram
ODP
Challenges in Large Scale Machine Learning
PDF
Building Big Data Streaming Architectures
PDF
Matsunaga crowdsourcing IEEE e-science 2014
PDF
Memory efficient java tutorial practices and challenges
PDF
Lecture on the annotation of transposable elements
PDF
Entity embeddings for categorical data
PPTX
2014 nicta-reproducibility
PDF
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
PPTX
Scaling HDFS for Exabyte Storage@twitter
PPTX
Data Mining: Mining stream time series and sequence data
PPTX
Data Mining: Mining stream time series and sequence data
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
MSR 2009
Online machine learning in Streaming Applications
Performance Issue? Machine Learning to the rescue!
Data Stream Algorithms in Storm and R
Modern Computing: Cloud, Distributed, & High Performance
Mining data streams using option trees
Lecture 1
Nbvtalkatjntuvizianagaram
Challenges in Large Scale Machine Learning
Building Big Data Streaming Architectures
Matsunaga crowdsourcing IEEE e-science 2014
Memory efficient java tutorial practices and challenges
Lecture on the annotation of transposable elements
Entity embeddings for categorical data
2014 nicta-reproducibility
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Scaling HDFS for Exabyte Storage@twitter
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Ad

Recently uploaded (20)

PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
1_Introduction to advance data techniques.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Major-Components-ofNKJNNKNKNKNKronment.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Knowledge Engineering Part 1
Launch Your Data Science Career in Kochi – 2025
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Quality review (1)_presentation of this 21
STUDY DESIGN details- Lt Col Maksud (21).pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Business Ppt On Nestle.pptx huunnnhhgfvu
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
1_Introduction to advance data techniques.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

Mining high speed data streams: Hoeffding and VFDT

  • 1. Mining High-Speed Data Streams Davide Gallitelli Politecnico di Torino – TELECOM ParisTech @DGallitelli95 Mining High-Speed Data Streams 1 Pedro Domingos University of Washington Geoff Hulten University of Washington
  • 2. 1. Introduction 2 Huge and Fast data streaming
  • 3. 1. Introduction 3 KDD systems operating continuously and indefinitely Limited by: • Time • Memory • Sample Size SPRINT Tested on up to a few million examples. Less than a day’s worth!
  • 5. Hoeffding Decision Tree 2. Hoeffding Trees 5
  • 6. 2. Hoeffding Trees 6  Classical DT learners are limited by main memory size  Probably, not all examples are needed to find the best attribute at a node  How to decide how many are necessary? Hoeffding Bound! «Suppose we have made 𝑛 independent observations of a variable 𝑟 with domain 𝑅, and computed their mean 𝑟. The Hoeffding bound states that, with probability 1 − 𝛿, the true mean of the variable is at least 𝑟 − 𝜖»
  • 7. 2. Hoeffding Trees 7 How many examples are enough? • Let 𝐺 𝑋𝑖 be the heuristic measure of choice (Information Gain, Gini Index) • 𝑋 𝑎 : the attribute with the highest attribute evaluation value after n examples • 𝑋 𝑏 : the attribute with the second highest split evaluation function value after n examples • We can compute ∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖 • Thanks to Hoeffding Bound, we can infer that: • ∆𝐺 ≥ ∆ 𝐺 − 𝜖 > 0 with probability 1 − 𝛿, where ∆𝐺 is the true difference in heuristic measure • This means that we can split the tree using 𝑋 𝑎, and the succeeding examples will be passed to the new leaves (incremental approach)
  • 8. 82. Hoeffding Trees • Compute the heuristic measure for the attributes and determine the best two attributes • At each node chack for the condition ∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖 • If true, create child nodes based on the test at the node; else, get more examples from stream. HT Algorithm
  • 9. 2. Hoeffding Trees 9 In a nutshell • Learning in Hoeffding tree is constant time per example (instance) and this means Hoeffding tree is suitable for data stream mining. • Requires each example to be read at most once (incrementally built). • With high probability, a Hoeffding tree is asymptotically identical to the decision tree built by a batch learner. 𝐸 ∆𝑖 𝐻𝑇𝛿, 𝐷𝑇∗ ≤ 𝛿 𝑝 • Independent of the probability distribution generating the observations • Built incrementally by sequential reading • Make class predictions in parallel • What happens with ties? • Memory used with tree expansion • Number of candidate attributes goo.gl/gBnm9h goo.gl/QvZMC7
  • 11. 113. VFDT System VFDT (Very Fast Decision Tree) • Hoeffding tree algorithm implementation is VFDT • VFDT includes refinements to the HT algorithm: • Tie-braking algorithm • Recompute G after a user-defined #examples • Deactivation of inactive leaves • Drop of unpromising early attributes (if ∆𝐺 > 𝜖) • Bootstrap with traditional learner on a small subset of data • Rescan of previously-seen examples
  • 12. 123. VFDT System Comparison with C4.5 𝛿 = 10−7 𝜏 = 5% 𝑛 𝑚𝑖𝑛 = 200
  • 13. 134. Application A VFDT application : Web Data • Mining the stream of Web page requests emanating from the whole University of Washington main campus. • Useful to improve Web Caching, by predicting which hosts and pages will be requested in the near future.
  • 14. 145. Conclusion Future Work • Test other applications (such as Intrusion detection) • Use of non-discretized numeric attributes • Use of post-pruning • Use of adaptive δ • Compare with other incremental algorithms (ID5R or SLIQ/SPRINT) • Adapt to time-changing domains (concept drift) • Parallelization

Editor's Notes

  • #3: Let’s think about two situations. On the left, the smart city of the future, with thousands of sensors and control systems. On the right, present days banking systems, which generates millions of transactions per day, and are expected to grow even more as e-shopping continues to spread. Thinking about the data produced by those systems, what are its main characteristics? < change > Size and Quantity. No more standard big data analytics, but high-speed data stream mining.
  • #4: Knowledge discovery systems are constrained by three main limited resources: time, memory and sample size. In traditional applications of machine learning and statistics, sample size tends to be the dominant limitation. In contrast, in many (if not most) present-day data mining applications, the bottleneck is time and memory, not examples. The latter are typically in over-supply, in the sense that it is impossible with current KDD systems to make use of all of them within the available computational resources. Currently, the most efficient algorithms available (e.g., SPRINT or BIRCH) concentrate on making it possible to mine databases that do not fit in main memory by only requiring sequential scans of the disk. But even these algorithms have only been tested on up to a few million examples. Ideally, we would like to have KDD systems that operate continuously and indefinitely, incorporating examples as they arrive, and never losing potentially valuable information. Incremental algorithms are out there, but they are either highly sensitive to example ordering, potentially never recovering from an unfavorable set of early examples, or produce results similar to batch classification with undesired overhead in computation time.
  • #5: Introducing: VFDT, a decision-tree learning system that overcomes the shortcomings of incremental algorithms. It is I/O bound, which means it mines examples in less time than it takes to input them from the disk, it’s an anytime algorithm, meaning that the model is ready-to-use at anytime, it does not store any examples and learns by seeing them exactly once.
  • #7: Hoeffding Trees are born from the limitations of classical decision tree learners, which assume all training data can be simultaneously stored in main memory. HT is based on the assumption that, in order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node. Given a stream of examples, the first ones will be used to choose the root test; once the root attribute is chosen, the succeeding examples will be passed down to the corresponding leaves and used to choose the appropriate attributes there, and so on recursively. We solve the difficult problem of deciding exactly how many examples are necessary at each node by using a statistical result known as the Hoeffding bound.
  • #8: So, how do we decide how many examples are enough?
  • #10: If HTδ is the tree produced by the Hoeffding tree algorithm with desired probability δ given infinite examples (Table 1), DT∗ is the asymptotic batch tree, and p is the leaf probability, then E[∆i(HTδ, DT∗)] ≤ δ/p. The smaller δ/p , the more similar the Hoeffding tree is to a subtree of the asymptotic batch tree.
  • #12: The Hoeffding tree algorithm was implemented into Very Fast Decision Tree learner (VFDT), which includes some enhancements for practical use. In case of ties, potentially many examples will be required to decide between them with some confidence, which is wasteful since they’re basically equivalent. VFDT splits on the current best attribute. Recomputing G is actually pretty expensive. In VFDT it is possible to define a parameter for the minimum number of examples read before recomputing G. Memory was an issue for HT, meaning that the moew the tree grew, the more memory it needed. VFDT deactivates inactive leaves, only keeping track of the probability of x falling into leaf l, times the observed error rate.