SlideShare a Scribd company logo
Mining Adaptively Frequent Closed Unlabeled
       Rooted Trees in Data Streams

            Albert Bifet and Ricard Gavaldà

               Universitat Politècnica de Catalunya


14th ACM SIGKDD International Conference on Knowledge
         Discovery and Data Mining (KDD’08)
                2008 Las Vegas, USA
Tree Mining
                                Mining frequent trees is
                                becoming an important
                                task
                                Applications:
                                    chemical informatics
                                    computer vision
                                    text retrieval
                                    bioinformatics
Data Streams
                                    Web analysis.
   Sequence is potentially
                                Many link-based
   infinite
                                structures may be
   High amount of data:         studied formally by
   sublinear space              means of unordered
   High speed of arrival:       trees
   sublinear time per
   example
Introduction: Data Streams

   Data Streams
      Sequence is potentially infinite
      High amount of data: sublinear space
      High speed of arrival: sublinear time per example
      Once an element from a data stream has been processed
      it is discarded or archived

  Example
  Puzzle: Finding Missing Numbers
      Let π be a permutation of {1, . . . , n}.
      Let π−1 be π with one element
      missing.
      π−1 [i] arrives in increasing order
  Task: Determine the missing number
Introduction: Data Streams

   Data Streams
      Sequence is potentially infinite
      High amount of data: sublinear space
      High speed of arrival: sublinear time per example
      Once an element from a data stream has been processed
      it is discarded or archived

  Example
  Puzzle: Finding Missing Numbers                  Use a n-bit
      Let π be a permutation of {1, . . . , n}.   vector to
      Let π−1 be π with one element               memorize all the
      missing.                                    numbers (O(n)
                                                  space)
      π−1 [i] arrives in increasing order
  Task: Determine the missing number
Introduction: Data Streams

   Data Streams
      Sequence is potentially infinite
      High amount of data: sublinear space
      High speed of arrival: sublinear time per example
      Once an element from a data stream has been processed
      it is discarded or archived

  Example
  Puzzle: Finding Missing Numbers
      Let π be a permutation of {1, . . . , n}.   Data Streams:
      Let π−1 be π with one element               O(log(n)) space.
      missing.
      π−1 [i] arrives in increasing order
  Task: Determine the missing number
Introduction: Data Streams

   Data Streams
      Sequence is potentially infinite
      High amount of data: sublinear space
      High speed of arrival: sublinear time per example
      Once an element from a data stream has been processed
      it is discarded or archived

  Example                                         Data Streams:
  Puzzle: Finding Missing Numbers                 O(log(n)) space.
      Let π be a permutation of {1, . . . , n}.   Store
      Let π−1 be π with one element               n(n + 1)
      missing.                                             − ∑ π−1 [j].
                                                     2       j≤i
      π−1 [i] arrives in increasing order
  Task: Determine the missing number
Introduction: Trees


  Our trees are:                   Our subtrees are:
      Unlabeled                         Induced
      Ordered and Unordered
                    Two different ordered trees
                   but the same unordered tree
Introduction


      Induced subtrees: obtained by repeatedly removing leaf
      nodes




      Embedded subtrees: obtained by contracting some of the
      edges
Introduction

   What Is Tree Pattern Mining?

   Given a dataset of trees, find the complete set of frequent
   subtrees
       Frequent Tree Pattern (FS):
            Include all the trees whose support is no less than min_sup
       Closed Frequent Tree Pattern (CS):
            Include no tree which has a super-tree with the same
            support
       CS ⊆ FS
       Closed Frequent Tree Mining provides a compact
       representation of frequent trees without loss of information
Introduction

   Unordered Subtree Mining
    A:               B:                   X:        Y:




                          D = {A, B}, min_sup = 2

                           # Closed Subtrees : 2
                          # Frequent Subtrees: 9

         Closed Subtrees: X, Y


         Frequent Subtrees:
Introduction


   Problem
   Given a data stream D of rooted, unlabelled and unordered
   trees, find frequent closed trees.


                                    We provide three algorithms,
                                    of increasing power
                                        Incremental
                                        Sliding Window
                                        Adaptive

               D
Outline


   1   Introduction


   2   Data Streams


   3   ADWIN : Concept Drift Mining


   4   Adaptive Closed Frequent Tree Mining


   5   Summary
Data Streams



  Data Streams
  At any time t in the data stream, we would like the per-item
  processing time and storage to be simultaneously
  O(log k (N, t)).

  Approximation algorithms
      Small error rate with high probability
                                                        ˜
      An algorithm (ε, δ )−approximates F if it outputs F for
                 ˜
      which Pr[|F − F | > εF ] < δ .
Data Streams Approximation Algorithms



            1011000111 1010101

  Sliding Window
  We can maintain simple statistics over sliding windows, using
  O( 1 log2 N) space, where
     ε
      N is the length of the sliding window
      ε is the accuracy parameter

      M. Datar, A. Gionis, P. Indyk, and R. Motwani.
      Maintaining stream statistics over sliding windows. 2002
Data Streams Approximation Algorithms



            10110001111 0101011

  Sliding Window
  We can maintain simple statistics over sliding windows, using
  O( 1 log2 N) space, where
     ε
      N is the length of the sliding window
      ε is the accuracy parameter

      M. Datar, A. Gionis, P. Indyk, and R. Motwani.
      Maintaining stream statistics over sliding windows. 2002
Data Streams Approximation Algorithms



            101100011110 1010111

  Sliding Window
  We can maintain simple statistics over sliding windows, using
  O( 1 log2 N) space, where
     ε
      N is the length of the sliding window
      ε is the accuracy parameter

      M. Datar, A. Gionis, P. Indyk, and R. Motwani.
      Maintaining stream statistics over sliding windows. 2002
Data Streams Approximation Algorithms



            1011000111101 0101110

  Sliding Window
  We can maintain simple statistics over sliding windows, using
  O( 1 log2 N) space, where
     ε
      N is the length of the sliding window
      ε is the accuracy parameter

      M. Datar, A. Gionis, P. Indyk, and R. Motwani.
      Maintaining stream statistics over sliding windows. 2002
Data Streams Approximation Algorithms



            10110001111010 1011101

  Sliding Window
  We can maintain simple statistics over sliding windows, using
  O( 1 log2 N) space, where
     ε
      N is the length of the sliding window
      ε is the accuracy parameter

      M. Datar, A. Gionis, P. Indyk, and R. Motwani.
      Maintaining stream statistics over sliding windows. 2002
Data Streams Approximation Algorithms



            101100011110101 0111010

  Sliding Window
  We can maintain simple statistics over sliding windows, using
  O( 1 log2 N) space, where
     ε
      N is the length of the sliding window
      ε is the accuracy parameter

      M. Datar, A. Gionis, P. Indyk, and R. Motwani.
      Maintaining stream statistics over sliding windows. 2002
Outline


   1   Introduction


   2   Data Streams


   3   ADWIN : Concept Drift Mining


   4   Adaptive Closed Frequent Tree Mining


   5   Summary
ADWIN: Adaptive sliding window

  ADWIN
  An adaptive sliding window whose size is recomputed online
  according to the rate of change observed.

  ADWIN has rigorous guarantees (theorems)
      On ratio of false positives and negatives
      On the relation of the size of the current window and
      change rates

  ADWIN using a Data Stream Sliding Window Model,
      can provide the exact counts of 1’s in O(1) time per point.
      tries O(log W ) cutpoints
      uses O( 1 log W ) memory words
              ε
      the processing time per example is O(log W ) (amortized
      and worst-case).
Time Change Detectors and Predictors: A General
Framework


                                          Estimation
                                          -
   xt
        -
            Estimator
Time Change Detectors and Predictors: A General
Framework


                                             Estimation
                                             -
   xt
        -
            Estimator                        Alarm
                        -                    -
                            Change Detect.
Time Change Detectors and Predictors: A General
Framework


                                                   Estimation
                                                   -
   xt
        -
            Estimator                              Alarm
                              -                    -
                                  Change Detect.
                 6
                                    6
                                    ?
            -
                     Memory
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           1 01010110111111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           10 1010110111111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           101 010110111111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           1010 10110111111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           10101 0110111111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           101010 110111111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           1010101 10111111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           10101011 0111111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           101010110 111111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           1010101101 11111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           10101011011 1111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           101010110111 111
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           1010101101111 11
[Dasu+ 06]
Window Management Models


                      W = 101010110111111

Equal & fixed size              Total window against
subwindows                     subwindow
      1010 1011011 1111               10101011011 1111
[Kifer+ 04]                    [Gama+ 04]

Equal size adjacent            ADWIN: All Adjacent
subwindows                     subwindows
     1010101 1011     1111           10101011011111 1
[Dasu+ 06]                     11
Outline


   1   Introduction


   2   Data Streams


   3   ADWIN : Concept Drift Mining


   4   Adaptive Closed Frequent Tree Mining


   5   Summary
Pattern Relaxed Support


     Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng,
     Yunfeng Liu and Kunqing Xie.
     CLAIM: An Efficient Method for Relaxed Frequent Closed
     Itemsets Mining over Stream Data
      Linear Relaxed Interval:The support space of all
      subpatterns can be divided into n = 1/εr intervals, where
      εr is a user-specified relaxed factor, and each interval can
      be denoted by Ii = [li , ui ), where li = (n − i) ∗ εr ≥ 0,
      ui = (n − i + 1) ∗ εr ≤ 1 and i ≤ n.
      Linear Relaxed closed subpattern t: if and only if there
      exists no proper superpattern t of t such that their suports
      belong to the same interval Ii .
Pattern Relaxed Support



  As the number of closed frequent patterns is not linear with
  respect support, we introduce a new relaxed support:
      Logarithmic Relaxed Interval:The support space of all
      subpatterns can be divided into n = 1/εr intervals, where
      εr is a user-specified relaxed factor, and each interval can
      be denoted by Ii = [li , ui ), where li = c i , ui = c i+1 − 1
      and i ≤ n.
      Logarithmic Relaxed closed subpattern t: if and only if
      there exists no proper superpattern t of t such that their
      suports belong to the same interval Ii .
Galois Lattice of closed set of trees




                           1            2          3




             D
  We need                  12                 13   23
      a Galois
      connection pair
      a closure operator


                                        123
Incremental mining on closed frequent trees

    1   Adding a tree
        transaction, does
        not decrease the
        number of closed
        trees for D.         1       2         3
    2   Adding a
        transaction with a
        closed tree, does
        not modify the
        number of closed     12           13   23
        trees for D.




                                    123
Sliding Window mining on closed frequent trees

    1   Deleting a tree
        transaction, does
        not increase the
        number of closed
        trees for D.          1     2            3
    2   Deleting a tree
        transaction that is
        repeated, does not
        modify the number
        of closed trees for   12         13      23
        D.




                                   123
Algorithms

  Algorithms
      Incremental: I NC T REE N AT
      Sliding Window: W IN T REE N AT
      Adaptive: A DAT REE N AT Uses ADWIN to monitor change

  ADWIN
  An adaptive sliding window whose size is recomputed online
  according to the rate of change observed.

  ADWIN has rigorous guarantees (theorems)
      On ratio of false positives and negatives
      On the relation of the size of the current window and
      change rates
Experimental Validation: TN1



                                         CMTreeMiner
            300

      Time 200
     (sec.)
            100
                                                  I NC T REE N AT

                          2        4          6         8
                               Size (Milions)

     Figure: Time on experiments on ordered trees on TN1 dataset
Experimental Validation


                                       45




                                       35
              Number of Closed Trees




                                       25                                                                             AdaTreeInc 1
                                                                                                                      AdaTreeInc 2



                                       15




                                       5
                                            0   21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140
                                                                       Number of Samples




   Figure: Number of closed trees maintaining the same number of
   closed datasets on input data
Outline


   1   Introduction


   2   Data Streams


   3   ADWIN : Concept Drift Mining


   4   Adaptive Closed Frequent Tree Mining


   5   Summary
Summary



  Conclusions
      New logarithmic relaxed closed support
      Using Galois Latice Theory, we present methods for mining
      closed trees
          Incremental: I NC T REE N AT
          Sliding Window: W IN T REE N AT
          Adaptive: A DAT REE N AT using ADWIN to monitor change

  Future Work
  Labeled Trees and XML data.

More Related Content

PDF
A Short Course in Data Stream Mining
PDF
Mining Frequent Closed Graphs on Evolving Data Streams
PDF
Internet of Things Data Science
PDF
Real-Time Big Data Stream Analytics
PPTX
STRIP: stream learning of influence probabilities.
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
PPTX
Scaling Python to CPUs and GPUs
PDF
PyCon Estonia 2019
A Short Course in Data Stream Mining
Mining Frequent Closed Graphs on Evolving Data Streams
Internet of Things Data Science
Real-Time Big Data Stream Analytics
STRIP: stream learning of influence probabilities.
Efficient Online Evaluation of Big Data Stream Classifiers
Scaling Python to CPUs and GPUs
PyCon Estonia 2019

What's hot (20)

PDF
Introduction to Deep Learning with Python
PDF
Text classification in scikit-learn
PDF
Deep Learning through Examples
PPTX
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
PDF
[241]large scale search with polysemous codes
PDF
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
PDF
Codes and Isogenies
PPTX
Scalable membership management
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
Alex Tellez, Deep Learning Applications
PDF
Introduction to Neural Networks in Tensorflow
PDF
Keynote at Converge 2019
PPTX
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
PDF
Latent Semantic Analysis of Wikipedia with Spark
PPTX
Diving into Deep Learning (Silicon Valley Code Camp 2017)
PDF
Array computing and the evolution of SciPy, NumPy, and PyData
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
PDF
Data Science and Machine Learning Using Python and Scikit-learn
PDF
Bayesian Counters
PDF
Sea Amsterdam 2014 November 19
Introduction to Deep Learning with Python
Text classification in scikit-learn
Deep Learning through Examples
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
[241]large scale search with polysemous codes
N-gram IDF: A Global Term Weighting Scheme Based on Information Distance (WWW...
Codes and Isogenies
Scalable membership management
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Alex Tellez, Deep Learning Applications
Introduction to Neural Networks in Tensorflow
Keynote at Converge 2019
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Latent Semantic Analysis of Wikipedia with Spark
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Array computing and the evolution of SciPy, NumPy, and PyData
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Data Science and Machine Learning Using Python and Scikit-learn
Bayesian Counters
Sea Amsterdam 2014 November 19
Ad

Viewers also liked (20)

PDF
Introduction to Big Data
PDF
Mining Implications from Lattices of Closed Trees
PDF
Postgraduate Studies: Graduate School Experience
PDF
@Travis pm. presents 'what is klout?'
PDF
Introduction to Big Data Science
PDF
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
PDF
Adaptive XML Tree Mining on Evolving Data Streams
PDF
MOA : Massive Online Analysis
PDF
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PDF
New ensemble methods for evolving data streams
PDF
Sentiment Knowledge Discovery in Twitter Streaming Data
PDF
Leveraging Bagging for Evolving Data Streams
PDF
Understanding the Effects of Streamlining the Orchestration of Learning Activ...
PDF
Kalman Filters and Adaptive Windows for Learning in Data Streams
PDF
Ad hoc vs. organised orchestration: A comparative analysis of technology-driv...
PDF
Distributed Decision Tree Learning for Mining Big Data Streams
PDF
Apache Samoa: Mining Big Data Streams with Apache Flink
PPTX
High Availability in YARN
PDF
Moa: Real Time Analytics for Data Streams
PPTX
Data warehouse
Introduction to Big Data
Mining Implications from Lattices of Closed Trees
Postgraduate Studies: Graduate School Experience
@Travis pm. presents 'what is klout?'
Introduction to Big Data Science
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Adaptive XML Tree Mining on Evolving Data Streams
MOA : Massive Online Analysis
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
New ensemble methods for evolving data streams
Sentiment Knowledge Discovery in Twitter Streaming Data
Leveraging Bagging for Evolving Data Streams
Understanding the Effects of Streamlining the Orchestration of Learning Activ...
Kalman Filters and Adaptive Windows for Learning in Data Streams
Ad hoc vs. organised orchestration: A comparative analysis of technology-driv...
Distributed Decision Tree Learning for Mining Big Data Streams
Apache Samoa: Mining Big Data Streams with Apache Flink
High Availability in YARN
Moa: Real Time Analytics for Data Streams
Data warehouse
Ad

Similar to Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams (20)

PDF
18 Data Streams
PPTX
DeepFak.pptx asdasdasdasdasdasdasdasdasd
PDF
Lecture 7: Recurrent Neural Networks
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
PPT
5.3 dyn algo-i
PPTX
Java and Deep Learning (Introduction)
PPT
5.1 mining data streams
PPTX
Publishing consuming Linked Sensor Data meetup Cuenca
PPT
Classification: Decision Trees , random Forest.ppt
PPTX
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
PPT
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
PDF
Deep Learning Based Voice Activity Detection and Speech Enhancement
PPT
19 algorithms-and-complexity-110627100203-phpapp02
KEY
Defense
PPTX
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
PDF
Time Series Data with Apache Cassandra
ODP
EOS5 Demo
ODP
End of Sprint 5
PDF
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
PPT
14889574 dl ml RNN Deeplearning MMMm.ppt
18 Data Streams
DeepFak.pptx asdasdasdasdasdasdasdasdasd
Lecture 7: Recurrent Neural Networks
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
5.3 dyn algo-i
Java and Deep Learning (Introduction)
5.1 mining data streams
Publishing consuming Linked Sensor Data meetup Cuenca
Classification: Decision Trees , random Forest.ppt
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Deep Learning Based Voice Activity Detection and Speech Enhancement
19 algorithms-and-complexity-110627100203-phpapp02
Defense
Spectral-, source-, connectivity- and network analysis of EEG and MEG data
Time Series Data with Apache Cassandra
EOS5 Demo
End of Sprint 5
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
14889574 dl ml RNN Deeplearning MMMm.ppt

More from Albert Bifet (11)

PDF
Artificial intelligence and data stream mining
PDF
MOA for the IoT at ACML 2016
PDF
Mining Big Data Streams with APACHE SAMOA
PDF
Real Time Big Data Management
PDF
Multi-label Classification with Meta-labels
PDF
Pitfalls in benchmarking data stream classification and how to avoid them
PDF
Efficient Data Stream Classification via Probabilistic Adaptive Windows
PPTX
Mining Big Data in Real Time
PDF
Mining Big Data in Real Time
PDF
Fast Perceptron Decision Tree Learning from Evolving Data Streams
PDF
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Artificial intelligence and data stream mining
MOA for the IoT at ACML 2016
Mining Big Data Streams with APACHE SAMOA
Real Time Big Data Management
Multi-label Classification with Meta-labels
Pitfalls in benchmarking data stream classification and how to avoid them
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Mining Big Data in Real Time
Mining Big Data in Real Time
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Adaptive Learning and Mining for Data Streams and Frequent Patterns

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPT
Teaching material agriculture food technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
sap open course for s4hana steps from ECC to s4
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Review of recent advances in non-invasive hemoglobin estimation
MYSQL Presentation for SQL database connectivity
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Teaching material agriculture food technology
MIND Revenue Release Quarter 2 2025 Press Release
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology

Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

  • 1. Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams Albert Bifet and Ricard Gavaldà Universitat Politècnica de Catalunya 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08) 2008 Las Vegas, USA
  • 2. Tree Mining Mining frequent trees is becoming an important task Applications: chemical informatics computer vision text retrieval bioinformatics Data Streams Web analysis. Sequence is potentially Many link-based infinite structures may be High amount of data: studied formally by sublinear space means of unordered High speed of arrival: trees sublinear time per example
  • 3. Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Let π−1 be π with one element missing. π−1 [i] arrives in increasing order Task: Determine the missing number
  • 4. Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Use a n-bit Let π be a permutation of {1, . . . , n}. vector to Let π−1 be π with one element memorize all the missing. numbers (O(n) space) π−1 [i] arrives in increasing order Task: Determine the missing number
  • 5. Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Data Streams: Let π−1 be π with one element O(log(n)) space. missing. π−1 [i] arrives in increasing order Task: Determine the missing number
  • 6. Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Data Streams: Puzzle: Finding Missing Numbers O(log(n)) space. Let π be a permutation of {1, . . . , n}. Store Let π−1 be π with one element n(n + 1) missing. − ∑ π−1 [j]. 2 j≤i π−1 [i] arrives in increasing order Task: Determine the missing number
  • 7. Introduction: Trees Our trees are: Our subtrees are: Unlabeled Induced Ordered and Unordered Two different ordered trees but the same unordered tree
  • 8. Introduction Induced subtrees: obtained by repeatedly removing leaf nodes Embedded subtrees: obtained by contracting some of the edges
  • 9. Introduction What Is Tree Pattern Mining? Given a dataset of trees, find the complete set of frequent subtrees Frequent Tree Pattern (FS): Include all the trees whose support is no less than min_sup Closed Frequent Tree Pattern (CS): Include no tree which has a super-tree with the same support CS ⊆ FS Closed Frequent Tree Mining provides a compact representation of frequent trees without loss of information
  • 10. Introduction Unordered Subtree Mining A: B: X: Y: D = {A, B}, min_sup = 2 # Closed Subtrees : 2 # Frequent Subtrees: 9 Closed Subtrees: X, Y Frequent Subtrees:
  • 11. Introduction Problem Given a data stream D of rooted, unlabelled and unordered trees, find frequent closed trees. We provide three algorithms, of increasing power Incremental Sliding Window Adaptive D
  • 12. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
  • 13. Data Streams Data Streams At any time t in the data stream, we would like the per-item processing time and storage to be simultaneously O(log k (N, t)). Approximation algorithms Small error rate with high probability ˜ An algorithm (ε, δ )−approximates F if it outputs F for ˜ which Pr[|F − F | > εF ] < δ .
  • 14. Data Streams Approximation Algorithms 1011000111 1010101 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  • 15. Data Streams Approximation Algorithms 10110001111 0101011 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  • 16. Data Streams Approximation Algorithms 101100011110 1010111 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  • 17. Data Streams Approximation Algorithms 1011000111101 0101110 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  • 18. Data Streams Approximation Algorithms 10110001111010 1011101 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  • 19. Data Streams Approximation Algorithms 101100011110101 0111010 Sliding Window We can maintain simple statistics over sliding windows, using O( 1 log2 N) space, where ε N is the length of the sliding window ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002
  • 20. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
  • 21. ADWIN: Adaptive sliding window ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and negatives On the relation of the size of the current window and change rates ADWIN using a Data Stream Sliding Window Model, can provide the exact counts of 1’s in O(1) time per point. tries O(log W ) cutpoints uses O( 1 log W ) memory words ε the processing time per example is O(log W ) (amortized and worst-case).
  • 22. Time Change Detectors and Predictors: A General Framework Estimation - xt - Estimator
  • 23. Time Change Detectors and Predictors: A General Framework Estimation - xt - Estimator Alarm - - Change Detect.
  • 24. Time Change Detectors and Predictors: A General Framework Estimation - xt - Estimator Alarm - - Change Detect. 6 6 ? - Memory
  • 25. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1 01010110111111 [Dasu+ 06]
  • 26. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10 1010110111111 [Dasu+ 06]
  • 27. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101 010110111111 [Dasu+ 06]
  • 28. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010 10110111111 [Dasu+ 06]
  • 29. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101 0110111111 [Dasu+ 06]
  • 30. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101010 110111111 [Dasu+ 06]
  • 31. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010101 10111111 [Dasu+ 06]
  • 32. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101011 0111111 [Dasu+ 06]
  • 33. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101010110 111111 [Dasu+ 06]
  • 34. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010101101 11111 [Dasu+ 06]
  • 35. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101011011 1111 [Dasu+ 06]
  • 36. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 101010110111 111 [Dasu+ 06]
  • 37. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 1010101101111 11 [Dasu+ 06]
  • 38. Window Management Models W = 101010110111111 Equal & fixed size Total window against subwindows subwindow 1010 1011011 1111 10101011011 1111 [Kifer+ 04] [Gama+ 04] Equal size adjacent ADWIN: All Adjacent subwindows subwindows 1010101 1011 1111 10101011011111 1 [Dasu+ 06] 11
  • 39. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
  • 40. Pattern Relaxed Support Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng, Yunfeng Liu and Kunqing Xie. CLAIM: An Efficient Method for Relaxed Frequent Closed Itemsets Mining over Stream Data Linear Relaxed Interval:The support space of all subpatterns can be divided into n = 1/εr intervals, where εr is a user-specified relaxed factor, and each interval can be denoted by Ii = [li , ui ), where li = (n − i) ∗ εr ≥ 0, ui = (n − i + 1) ∗ εr ≤ 1 and i ≤ n. Linear Relaxed closed subpattern t: if and only if there exists no proper superpattern t of t such that their suports belong to the same interval Ii .
  • 41. Pattern Relaxed Support As the number of closed frequent patterns is not linear with respect support, we introduce a new relaxed support: Logarithmic Relaxed Interval:The support space of all subpatterns can be divided into n = 1/εr intervals, where εr is a user-specified relaxed factor, and each interval can be denoted by Ii = [li , ui ), where li = c i , ui = c i+1 − 1 and i ≤ n. Logarithmic Relaxed closed subpattern t: if and only if there exists no proper superpattern t of t such that their suports belong to the same interval Ii .
  • 42. Galois Lattice of closed set of trees 1 2 3 D We need 12 13 23 a Galois connection pair a closure operator 123
  • 43. Incremental mining on closed frequent trees 1 Adding a tree transaction, does not decrease the number of closed trees for D. 1 2 3 2 Adding a transaction with a closed tree, does not modify the number of closed 12 13 23 trees for D. 123
  • 44. Sliding Window mining on closed frequent trees 1 Deleting a tree transaction, does not increase the number of closed trees for D. 1 2 3 2 Deleting a tree transaction that is repeated, does not modify the number of closed trees for 12 13 23 D. 123
  • 45. Algorithms Algorithms Incremental: I NC T REE N AT Sliding Window: W IN T REE N AT Adaptive: A DAT REE N AT Uses ADWIN to monitor change ADWIN An adaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) On ratio of false positives and negatives On the relation of the size of the current window and change rates
  • 46. Experimental Validation: TN1 CMTreeMiner 300 Time 200 (sec.) 100 I NC T REE N AT 2 4 6 8 Size (Milions) Figure: Time on experiments on ordered trees on TN1 dataset
  • 47. Experimental Validation 45 35 Number of Closed Trees 25 AdaTreeInc 1 AdaTreeInc 2 15 5 0 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140 Number of Samples Figure: Number of closed trees maintaining the same number of closed datasets on input data
  • 48. Outline 1 Introduction 2 Data Streams 3 ADWIN : Concept Drift Mining 4 Adaptive Closed Frequent Tree Mining 5 Summary
  • 49. Summary Conclusions New logarithmic relaxed closed support Using Galois Latice Theory, we present methods for mining closed trees Incremental: I NC T REE N AT Sliding Window: W IN T REE N AT Adaptive: A DAT REE N AT using ADWIN to monitor change Future Work Labeled Trees and XML data.