Mining high speed data streams: Hoeffding and VFDT

Mining High-Speed Data Streams
Davide Gallitelli
Politecnico di Torino – TELECOM ParisTech
@DGallitelli95
Mining High-Speed Data Streams 1
Pedro Domingos
University of Washington
Geoff Hulten
University of Washington

1. Introduction 2
Huge and Fast data streaming

1. Introduction 3
KDD systems
operating
continuously
and indefinitely
Limited by:
• Time
• Memory
• Sample Size
SPRINT
Tested on up to
a few million
examples.
Less than a
day’s worth!

41. Introduction
VERY
FAST
DECISION
TREE

Hoeffding Decision Tree
2. Hoeffding Trees 5

 Classical DT learners are limited by main memory size
 Probably, not all examples are needed to find the best attribute at a node
 How to decide how many are necessary? Hoeffding Bound!
«Suppose we have made 𝑛 independent observations of a variable 𝑟 with
domain 𝑅, and computed their mean 𝑟. The Hoeffding bound states that,
with probability 1 − 𝛿, the true mean of the variable is at least 𝑟 − 𝜖»

How many examples are enough?
• Let 𝐺 𝑋𝑖 be the heuristic measure of choice (Information Gain, Gini Index)
• 𝑋 𝑎 : the attribute with the highest attribute evaluation value after n examples
• 𝑋 𝑏 : the attribute with the second highest split evaluation function value after n
examples
• We can compute
∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖
• Thanks to Hoeffding Bound, we can infer that:
• ∆𝐺 ≥ ∆ 𝐺 − 𝜖 > 0 with probability 1 − 𝛿, where ∆𝐺 is the true difference in
heuristic measure
• This means that we can split the tree using 𝑋 𝑎, and the succeeding examples
will be passed to the new leaves (incremental approach)

82. Hoeffding Trees
• Compute the heuristic measure
for the attributes and determine
the best two attributes
• At each node chack for the
condition
∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖
• If true, create child nodes based
on the test at the node; else, get
more examples from stream.
HT Algorithm

In a nutshell
• Learning in Hoeffding tree is constant time per example (instance) and
this means Hoeffding tree is suitable for data stream mining.
• Requires each example to be read at most once (incrementally built).
• With high probability, a Hoeffding tree is asymptotically identical to the
decision tree built by a batch learner.
𝐸 ∆𝑖 𝐻𝑇𝛿, 𝐷𝑇∗ ≤
𝛿
𝑝
• Independent of the probability
distribution generating the observations
• Built incrementally by sequential reading
• Make class predictions in parallel
• What happens with ties?
• Memory used with tree expansion
• Number of candidate attributes
goo.gl/gBnm9h
goo.gl/QvZMC7

113. VFDT System
VFDT (Very Fast Decision Tree)
• Hoeffding tree algorithm implementation is VFDT
• VFDT includes refinements to the HT algorithm:
• Tie-braking algorithm
• Recompute G after a user-defined #examples
• Deactivation of inactive leaves
• Drop of unpromising early attributes (if ∆𝐺 > 𝜖)
• Bootstrap with traditional learner on a small
subset of data
• Rescan of previously-seen examples

123. VFDT System
Comparison with C4.5
𝛿 = 10−7
𝜏 = 5%
𝑛 𝑚𝑖𝑛 = 200

134. Application
A VFDT application : Web Data
• Mining the stream of Web page requests emanating
from the whole University of Washington main
campus.
• Useful to improve Web Caching, by predicting which
hosts and pages will be requested in the near future.

145. Conclusion
Future Work
• Test other applications (such as Intrusion detection)
• Use of non-discretized numeric attributes
• Use of post-pruning
• Use of adaptive δ
• Compare with other incremental algorithms (ID5R or SLIQ/SPRINT)
• Adapt to time-changing domains (concept drift)
• Parallelization

Mining high speed data streams: Hoeffding and VFDT

More Related Content

What's hot (20)

Similar to Mining high speed data streams: Hoeffding and VFDT (20)

Recently uploaded (20)

Mining high speed data streams: Hoeffding and VFDT

Editor's Notes