SlideShare a Scribd company logo
iSAX 2.0 INDEXING & MINING
 ONE BILLION TIME SERIES

          Paper By:
      Alessandro Camerra
       Themis Palpanas
           Jin Shieh
        Eamonn Keogh
iSAX 2.0 INDEXING & MINING
 ONE BILLION TIME SERIES

       Presented by:
         Vasu Jain
Contents
1. Introduction

2. Preliminaries
    a. The SAX Representation
    b. The iSAX Representation
    c. Indexing iSAX

3. iSAX 2 Index
    a. Bulk Loading
    b. Node Splitting Policy

4. Experimental Evaluations

5. Conclusion & Future work

6. References



                                 CONTENTS | 2
Time Series

Sequence of data points measured at successive time instants spaced
at uniform time intervals.

Time Series T = t1, t2, t3, t4 …… tn
    n real valued observations at equal intervals of time

e.g. Stock Prices, Acid precipitation data at a single location




                                                                  INTRODUCTION | 3
Time Series Examples
                                       Historical archives
    Sensors            DNA sequences




                                              Motion data

 Stocks                ECG
 data




                                               INTRODUCTION | 4
Introduction
•   Indexing and Mining time series is hot. There is a pressing need for
    indexing and mining Time series data

•   Time series of order 100 of Millions to Billions

“…we have about a million samples per minute coming in from 1000 gas
turbines around the world… we need to be able to do similarity search for...”
Lane Desborough, GE.
 “…an archival rate of 3.6 billion points a day, how can we (do similarity
search) in this data?” Josh Patterson, TVA.

•   Two bottlenecks while mining these massive Time Series
     • Time complexity of building the index
         (6 days to index 100 million objects)
     • Time to retrieve data from the disk



                                                                         INTRODUCTION | 5
Introduction
•   iSAX 2.0
     • To solve these problem
     • A data structure which is an extension to iSAX
     • Suitable for indexing and mining very large time series.

•   Novel mechanism for scalable indexing of time series: Bulk loading
    algorithm, node splitting policies.

•   Results:
     •   Index building time reduced by 72% with bulk loading scheme
     •   Index size reduced by 27%
     •   Number of disk page accesses reduced by 50%
     •   Scalability achieved allows to consider new challenges in data
         mining problems which have been untenable otherwise.
     •   First approach that is experimentally validated to scale Time series
         data collections with up to 1 Billion objects.

                                                                    INTRODUCTION | 6
Preliminaries : The SAX representation
     1. Represent a time series T of length n in w-dimensional space using PAA
             Where the ith element is:
                                                                        ni


                                                                      T
                                                                        w

                                                       ti  w
                                                            n                      j
                                                                 j   n ( i 1) 1
                                                                      w

     2. Then discretize into a vector of symbols
          Breakpoints map to a small alphabet a of symbols
          For every segment a bit wise representation is assigned to code that region
                                                                                   3
 3                                      3

                                                                                   2   iSAX(T,4,4)
 2     A time series T                  2   PAA(T,4)
                                                                                                                         00
                                                                                   1
 1                                      1
                                                                                                                         01
                                                                                   0
 0                                      0                                                                                10

-1                                     -1
                                                                                  -1                                     11

-2                                                                                -2
                                       -2



-3                                                                                -3
                                       -3
       0        4        8   12   16        0      4   8        12           16        0        4    8       12     16




     i) A time series T, of length 16. ii) A PAA approximation of T, with 4 segments.
     iii) A time series T converted into SAX words of cardinality 4


                                                                                                         PRELIMINARIES | 7
Preliminaries : The SAX representation




                                         PRELIMINARIES | 8
Preliminaries : The SAX representation

More About SAX Representation

   1. The SAX Representation supports arbitrary breakpoints.
   2. A SAX word is simply a vector of discrete numbers.
        e.g. SAX word can be written as {3, 3, 1, 0} or in binary as {11, 11, 01, 00}.
   3. Denoted by T4, produced by function SAX(T,4,4).
   4. SAX(T, w, c) = TC where w = word length, c = cardinality of symbols
                      SAX(T,4,2) = T2 = {1, 1, 0, 0}
                      SAX(T,4,4) = T2 = {11, 11, 01, 00}
   5. Converting to a reduced cardinality SAX word by ignoring Trailing Bits




                                                                        PRELIMINARIES | 9
Preliminaries : iSAX representation
   1. Tedious to write binary string in SAX Representation
   2. To include cardinality in iSAX representation we add cardinality as subscript
                      iSAX(T,4,8) = T8 = {68,68,38,08}
   3. Key property: Ability to compare two iSAX words of different cardinalities.



   4. Promote the lower cardinality representation into the cardinality of the larger
            Promote S2 as

   5. Obtain missing bit values by returning the one closest in SAX space to the
   corresponding value in T8. Not exact but admissible value of word S.


   6. We can compare iSAX words where each word has mixed cardinality
                        {111, 11, 101, 0} = {78,34,58,02}


                                                                       PRELIMINARIES | 10
Preliminaries : Indexing iSAX
   1. Using the iSAX representation, we can produce a set of bw different
      mutually exclusive iSAX words. (b = cardinality, w = word length)

   2. Words can be represented by files
       e.g word {68,68,38,08} can be mapped to 6.8_6.8_3.8_0.8.txt

   3. Threshold th defines maximum time series a word can hold.

   4. For overflow, split the file. To split choose one iSAX symbol, examine an
      additional bit, and use its value to create two new files.

                              {24,34,34,24}
                              (Original file)


            {48,34,34,24}                       {58,34,34,24}
            (child file 1)                       (child file 2)

                                                                     PRELIMINARIES | 11
Preliminaries : Indexing iSAX
   1. File splitting produces hierarchical, but unbalanced, index structure that
      contains non-overlapping regions

   2. Three class of nodes:

       1. Root Node: Representative of the complete iSAX space. Contains no
          SAX representation, but only pointers to the children nodes.
       2. Leaf Node: Contains a pointer to an index file on disk with the raw time
          series entries. Stores the highest cardinality iSAX word for each time
          series.
       3. Internal Node: Created when the number of time series contained by a
          leaf node exceeds threshold th.

   3. iSAX employs binary splits along a single dimension




                                                                   PRELIMINARIES | 12
THE iSAX 2.0 INDEX
  Disadvantages of iSAX
  1. Take too long to build index for large datasets.
      Indexing a dataset with 500 M time series takes 20 days to complete

  2. Because:
      • Naïve node splitting policy leading to ineffective splits and additional
         disk I/O
      • No bulk loading strategy, does not use available main memory to
         reduce disk I/O

  3. To overcome these challenges:
      • New algorithm for Time series bulk loading that reduces disk I/O
      • New Node splitting policy resulting in more compact index, and hence
         further reducing the I/O cost.

  4.   We will refer to this improved iSAX index as iSAX 2.0.


                                                                    iSAX 2.0 INDEX | 13
THE iSAX 2.0 INDEX
  Bulk Loading
  Design Principles
      • Minimize the number of disk I/O operations
      • Maximize sequential disk access
      • Take advantage of available Main memory

  Approach for Proposed solution
      • Instead of building entire index at once, build distinct sub trees of index
        one at a time

  Assumptions
      • Limited main memory i.e. less than necessary to fit entire dataset and
        index




                                                                     iSAX 2.0 INDEX | 14
THE iSAX 2.0 INDEX
  Bulk Loading: Algorithm basics
  Uses two main memory buffer layers

      First Buffer Layer (FBL) :
      • Corresponds to first level of iSAX 2.0 nodes
      • Cluster together the Time series that will end up in same iSAX 2.0 sub
         tree rooted in one of the direct children of the root.
      • No size restriction, grow till they occupy available main memory.

      Last Buffer Layer (LBL) :
      • Corresponds to leaf nodes
      • Gather all Time series of leaf nodes and flush them to disk
      • Same size as of leaf nodes on disk




                                                                  iSAX 2.0 INDEX | 15
THE iSAX 2.0 INDEX
  Bulk Loading: Algorithm Description
  Algorithm operates in two phases which alternate until entire dataset is indexed

      Phase 1:
      • Time series is inserted in the corresponding FBL buffer until the main
        memory is almost full.
      • At the end of Phase 1, we have time series collected in the FBL buffers.
      • Corresponding (leaf) nodes L1, L2, L3, of the index are not yet created.




                                                                    iSAX 2.0 INDEX | 16
THE iSAX 2.0 INDEX
  Bulk Loading: Algorithm Description
     Phase 2:
     • Time series contained in each FBL buffer is moved to the appropriate
       LBL buffers.

     • Sequentially for each FBL buffer, the algorithm reads the time series and
       creates the entire sub tree (with Internal and leaf iSAX 2.0 nodes) rooted
       at the node corresponding to that FBL buffer.

     • By emptying the right-most FBL buffer, we create the sub tree rooted at
       internal node I1. The algorithm also creates for each leaf node a
       corresponding LBL buffer

     • After all Time series of a FBL buffer moves to corresponding LBL buffers,
       these LBL are flushed to disk making memory available for use.



                                                                 iSAX 2.0 INDEX | 17
THE iSAX 2.0 INDEX
  Bulk Loading: Algorithm Description
     Phase 2:
     • At the end of Phase 2, all the time series from the FBL buffers have
       moved down the tree to the appropriate leaf nodes and LBL buffers, and
       then from the LBL buffers to the disk.
     • Now both FBL and LBL are empty, and we are ready for next iteration of
       our algorithm going back to Phase 1.
     • This process continues until the entire dataset has been indexed.




                                                              iSAX 2.0 INDEX | 18
THE iSAX 2.0 INDEX
  Node Splitting Policy
  Design Principles
      • Keep the index as small as can
      • Avoid poor utilization of leaf nodes thus reducing the length of index
      • Make splits that distribute Time series equally to children nodes

  Approach for Proposed solution
      • Examine for each segment the distributions of the highest cardinality
        symbols across the relevant time series.
      • Split the segment for which the highest cardinality iSAX symbols lie on
        both sides of the breakpoint

  Solution
      • Split segment for which breakpoint is within range μ ± 3σ and closest to μ
         where σ is Standard deviation and μ is mean for the Symbol



                                                                    iSAX 2.0 INDEX | 19
THE iSAX 2.0 INDEX
  Node Splitting Policy
  Example
     • iSAX word of length (segments) four, split a node with cardinality 2 (for
        all segments). Compute μ ± 3σ value for each segment

      • Segment 1 lies entirely below the lowest breakpoint of cardinality 4 (i.e.,
        the cardinality of the two new nodes after the split) ∴ No Split

      • Segments 2 and 3 ranges cross some breakpoint of cardinality 4.
        Split on segment 3, because its μ value lies closer to a breakpoint than
        that of segment 2.

      • This tells some of the time series in the node to be split will end up in
        the new node representing the area above the breakpoint, while the rest
        will move to the second new node thus, achieving a balanced split.



                                                                    iSAX 2.0 INDEX | 20
THE iSAX 2.0 INDEX
  Node Splitting Policy




                          iSAX 2.0 INDEX | 21
THE iSAX 2.0 INDEX
  Psuedo Code

  Bulk Loading Algorithm

  Node splitting Algorithm




                             iSAX 2.0 INDEX | 22
Experimental Evaluation
Experimental Framework

• Intel Xeon E5504 , 24 GB Memory, 2TB disk, Windows Vista SP2
• AMD Athlon 64 X2, 3 GB Memory, 400 GB disk, Windows XP SP2
• Implemented in C#/.Net 3.5

Other Algorithms used for comparison
• iSAX 2.0 compared with iSAX and iSAX Buffer Tree

Datasets
•   Up to 1 Billion random time series of size 256 (~ 2TB)
•   20 Million subsequences of length 176
•   22 Million DNA subsequences of size 640
•   70 Million web image color histograms of size 256 (~133 GB)




                                                       EXPERIMENTAL EVALUATION | 23
Experimental Evaluation
Splitting Policy




• On an average 34% less nodes than iSAX index
• On an average 30% less time to build an index
• Leaf nodes with an average of over 54% more occupancy



                                                  EXPERIMENTAL EVALUATION | 24
Experimental Evaluation
Bulk Loading Evaluation

• Index built time for 1 Billion time
  series reduced to 72%
  From 2 months to 16 days

• New strategy leads to 35% less
  disk usage

• More than 99.5% sequential disk
  page accesses




                                        EXPERIMENTAL EVALUATION | 25
Experimental Evaluation
Case Study in Entomology

• Early detection of harmful insect behavior
• 20 Million Time series of size 176
• 6 Hours to index (~ 27 GB disk space), 0.5 second to answer 1-NN query

Mining Massive DNA Sequences
• Investigating genetic relationships
• 22 Million Time series of size 640
• 9 Hours to index (~ 115 GB disk space), 0.6 second to answer 10-NN query

Mining Massive Image collections
• Finding similar images
• 70 Million web image color histograms of size 256 (~133 GB)
• 12 Hours to index (~ 133 GB disk space), 0.9 second to answer 1-NN query



                                                    EXPERIMENTAL EVALUATION | 26
Experimental Evaluation
Case Study in Entomology




Mining Massive DNA Sequences




Mining Massive Image collections




                                   EXPERIMENTAL EVALUATION | 27
Conclusions and future work
Conclusions

Proposed iSAX 2.0
    • indexing for billion size time series collections

Experimentally validated proposed approach
    • First published experiment with analysis over 1 Billion time series

Case studies in diverse domains shows usefulness of this approach
   • Analysis of vast collection of data in fields like Etymology, web
       mining, DNA sequencing etc.

Future Work
Index and mine larger time series of order 10 Billion or more
    • e.g. Functional magnetic resonance imaging (fMRI)
    • Single experiment (1 subject, 1 test) produces 60,000 TS of length
       3000 leading to 12GB


                                                          CONCLUSIONS & FUTURE WORK | 28
REFERENCES
• iSAX 2.0 Indexing and Mining Billion time Series : Alessandro Camerra, Themis
  Palpanas, Jin Shieh and Eamonn Keogh (2010) ICDM 2010 [pdf]

• Microsoft Research iSAX 2.0: Indexing and Mining One Billion Time Series
  MSR URL

• Images Source 1, Source 2




                                                                    REFERENCES | 29
THANK YOU

   Q/A




            REFERENCES | 29

More Related Content

PDF
quantization
PPTX
Managing Large Scale Financial Time-Series Data with Graphs
PDF
Wavelet transform and its applications in data analysis and signal and image ...
PDF
Image similarity using symbolic representation and its variations
PDF
Chapter 1 Functions Relations V3
PPTX
The FFT And Spectral Analysis
PDF
Ray Richardson, Chief Technology Officer at Simularity at MLconf SEA - 5/01/15
PPT
Ch1 2
quantization
Managing Large Scale Financial Time-Series Data with Graphs
Wavelet transform and its applications in data analysis and signal and image ...
Image similarity using symbolic representation and its variations
Chapter 1 Functions Relations V3
The FFT And Spectral Analysis
Ray Richardson, Chief Technology Officer at Simularity at MLconf SEA - 5/01/15
Ch1 2

Similar to Indexing and Mining a Billion Time series using iSAX 2.0 (20)

PDF
Streamy, Pipy, Analyticy
PDF
PDF
Statistics lecture 13 (chapter 13)
PDF
Distinguishing the signal from noise in an SVD of simulation data
PDF
Sketch sort ochadai20101015-public
PDF
Functional Programming In Mathematica
PDF
GEOG3839.9: Climate from trees
PPT
Basic Theory (FE)
PDF
Signal processingcolumbia
PPT
Introduction to matlab
PDF
Elementary Landscape Decomposition of Combinatorial Optimization Problems
PDF
What the matrix can tell us about the social network.
PDF
Linear Equations And Graphing
PDF
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
KEY
Verification with LoLA: 7 Implementation
PDF
Polish nootation
PDF
Lecture.1
Streamy, Pipy, Analyticy
Statistics lecture 13 (chapter 13)
Distinguishing the signal from noise in an SVD of simulation data
Sketch sort ochadai20101015-public
Functional Programming In Mathematica
GEOG3839.9: Climate from trees
Basic Theory (FE)
Signal processingcolumbia
Introduction to matlab
Elementary Landscape Decomposition of Combinatorial Optimization Problems
What the matrix can tell us about the social network.
Linear Equations And Graphing
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Verification with LoLA: 7 Implementation
Polish nootation
Lecture.1
Ad

More from Vasu Jain (14)

PPTX
Tech jobs beyond programming - Game design
PPTX
Conversational AI & Smart Homes
PDF
Identifying and solving enterprise problems
PDF
Building Enterprise Chat Bots
PDF
Chat Bots and how to build a Slack bot
PPTX
Sentiment analysis of tweets [SCSE, 13]
PPTX
Sentiment analysis of tweets
PPT
How google is using linked data today and vision for tomorrow
PPTX
Internet Explorer 9
PPTX
Cloud Computing Architecture
PPTX
Power Point 2010
PPTX
Microsoft Office 2010 Overview
PPTX
Windows 7 What's Next
PPTX
Windows Phone 7
Tech jobs beyond programming - Game design
Conversational AI & Smart Homes
Identifying and solving enterprise problems
Building Enterprise Chat Bots
Chat Bots and how to build a Slack bot
Sentiment analysis of tweets [SCSE, 13]
Sentiment analysis of tweets
How google is using linked data today and vision for tomorrow
Internet Explorer 9
Cloud Computing Architecture
Power Point 2010
Microsoft Office 2010 Overview
Windows 7 What's Next
Windows Phone 7
Ad

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Spectroscopy.pptx food analysis technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PDF
KodekX | Application Modernization Development
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
MIND Revenue Release Quarter 2 2025 Press Release
Spectroscopy.pptx food analysis technology
Per capita expenditure prediction using model stacking based on satellite ima...
sap open course for s4hana steps from ECC to s4
20250228 LYD VKU AI Blended-Learning.pptx
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
KodekX | Application Modernization Development
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
Digital-Transformation-Roadmap-for-Companies.pptx

Indexing and Mining a Billion Time series using iSAX 2.0

  • 1. iSAX 2.0 INDEXING & MINING ONE BILLION TIME SERIES Paper By: Alessandro Camerra Themis Palpanas Jin Shieh Eamonn Keogh
  • 2. iSAX 2.0 INDEXING & MINING ONE BILLION TIME SERIES Presented by: Vasu Jain
  • 3. Contents 1. Introduction 2. Preliminaries a. The SAX Representation b. The iSAX Representation c. Indexing iSAX 3. iSAX 2 Index a. Bulk Loading b. Node Splitting Policy 4. Experimental Evaluations 5. Conclusion & Future work 6. References CONTENTS | 2
  • 4. Time Series Sequence of data points measured at successive time instants spaced at uniform time intervals. Time Series T = t1, t2, t3, t4 …… tn n real valued observations at equal intervals of time e.g. Stock Prices, Acid precipitation data at a single location INTRODUCTION | 3
  • 5. Time Series Examples Historical archives Sensors DNA sequences Motion data Stocks ECG data INTRODUCTION | 4
  • 6. Introduction • Indexing and Mining time series is hot. There is a pressing need for indexing and mining Time series data • Time series of order 100 of Millions to Billions “…we have about a million samples per minute coming in from 1000 gas turbines around the world… we need to be able to do similarity search for...” Lane Desborough, GE. “…an archival rate of 3.6 billion points a day, how can we (do similarity search) in this data?” Josh Patterson, TVA. • Two bottlenecks while mining these massive Time Series • Time complexity of building the index (6 days to index 100 million objects) • Time to retrieve data from the disk INTRODUCTION | 5
  • 7. Introduction • iSAX 2.0 • To solve these problem • A data structure which is an extension to iSAX • Suitable for indexing and mining very large time series. • Novel mechanism for scalable indexing of time series: Bulk loading algorithm, node splitting policies. • Results: • Index building time reduced by 72% with bulk loading scheme • Index size reduced by 27% • Number of disk page accesses reduced by 50% • Scalability achieved allows to consider new challenges in data mining problems which have been untenable otherwise. • First approach that is experimentally validated to scale Time series data collections with up to 1 Billion objects. INTRODUCTION | 6
  • 8. Preliminaries : The SAX representation 1. Represent a time series T of length n in w-dimensional space using PAA Where the ith element is: ni T w ti  w n j j n ( i 1) 1 w 2. Then discretize into a vector of symbols Breakpoints map to a small alphabet a of symbols For every segment a bit wise representation is assigned to code that region 3 3 3 2 iSAX(T,4,4) 2 A time series T 2 PAA(T,4) 00 1 1 1 01 0 0 0 10 -1 -1 -1 11 -2 -2 -2 -3 -3 -3 0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 i) A time series T, of length 16. ii) A PAA approximation of T, with 4 segments. iii) A time series T converted into SAX words of cardinality 4 PRELIMINARIES | 7
  • 9. Preliminaries : The SAX representation PRELIMINARIES | 8
  • 10. Preliminaries : The SAX representation More About SAX Representation 1. The SAX Representation supports arbitrary breakpoints. 2. A SAX word is simply a vector of discrete numbers. e.g. SAX word can be written as {3, 3, 1, 0} or in binary as {11, 11, 01, 00}. 3. Denoted by T4, produced by function SAX(T,4,4). 4. SAX(T, w, c) = TC where w = word length, c = cardinality of symbols SAX(T,4,2) = T2 = {1, 1, 0, 0} SAX(T,4,4) = T2 = {11, 11, 01, 00} 5. Converting to a reduced cardinality SAX word by ignoring Trailing Bits PRELIMINARIES | 9
  • 11. Preliminaries : iSAX representation 1. Tedious to write binary string in SAX Representation 2. To include cardinality in iSAX representation we add cardinality as subscript iSAX(T,4,8) = T8 = {68,68,38,08} 3. Key property: Ability to compare two iSAX words of different cardinalities. 4. Promote the lower cardinality representation into the cardinality of the larger Promote S2 as 5. Obtain missing bit values by returning the one closest in SAX space to the corresponding value in T8. Not exact but admissible value of word S. 6. We can compare iSAX words where each word has mixed cardinality {111, 11, 101, 0} = {78,34,58,02} PRELIMINARIES | 10
  • 12. Preliminaries : Indexing iSAX 1. Using the iSAX representation, we can produce a set of bw different mutually exclusive iSAX words. (b = cardinality, w = word length) 2. Words can be represented by files e.g word {68,68,38,08} can be mapped to 6.8_6.8_3.8_0.8.txt 3. Threshold th defines maximum time series a word can hold. 4. For overflow, split the file. To split choose one iSAX symbol, examine an additional bit, and use its value to create two new files. {24,34,34,24} (Original file) {48,34,34,24} {58,34,34,24} (child file 1) (child file 2) PRELIMINARIES | 11
  • 13. Preliminaries : Indexing iSAX 1. File splitting produces hierarchical, but unbalanced, index structure that contains non-overlapping regions 2. Three class of nodes: 1. Root Node: Representative of the complete iSAX space. Contains no SAX representation, but only pointers to the children nodes. 2. Leaf Node: Contains a pointer to an index file on disk with the raw time series entries. Stores the highest cardinality iSAX word for each time series. 3. Internal Node: Created when the number of time series contained by a leaf node exceeds threshold th. 3. iSAX employs binary splits along a single dimension PRELIMINARIES | 12
  • 14. THE iSAX 2.0 INDEX Disadvantages of iSAX 1. Take too long to build index for large datasets. Indexing a dataset with 500 M time series takes 20 days to complete 2. Because: • Naïve node splitting policy leading to ineffective splits and additional disk I/O • No bulk loading strategy, does not use available main memory to reduce disk I/O 3. To overcome these challenges: • New algorithm for Time series bulk loading that reduces disk I/O • New Node splitting policy resulting in more compact index, and hence further reducing the I/O cost. 4. We will refer to this improved iSAX index as iSAX 2.0. iSAX 2.0 INDEX | 13
  • 15. THE iSAX 2.0 INDEX Bulk Loading Design Principles • Minimize the number of disk I/O operations • Maximize sequential disk access • Take advantage of available Main memory Approach for Proposed solution • Instead of building entire index at once, build distinct sub trees of index one at a time Assumptions • Limited main memory i.e. less than necessary to fit entire dataset and index iSAX 2.0 INDEX | 14
  • 16. THE iSAX 2.0 INDEX Bulk Loading: Algorithm basics Uses two main memory buffer layers First Buffer Layer (FBL) : • Corresponds to first level of iSAX 2.0 nodes • Cluster together the Time series that will end up in same iSAX 2.0 sub tree rooted in one of the direct children of the root. • No size restriction, grow till they occupy available main memory. Last Buffer Layer (LBL) : • Corresponds to leaf nodes • Gather all Time series of leaf nodes and flush them to disk • Same size as of leaf nodes on disk iSAX 2.0 INDEX | 15
  • 17. THE iSAX 2.0 INDEX Bulk Loading: Algorithm Description Algorithm operates in two phases which alternate until entire dataset is indexed Phase 1: • Time series is inserted in the corresponding FBL buffer until the main memory is almost full. • At the end of Phase 1, we have time series collected in the FBL buffers. • Corresponding (leaf) nodes L1, L2, L3, of the index are not yet created. iSAX 2.0 INDEX | 16
  • 18. THE iSAX 2.0 INDEX Bulk Loading: Algorithm Description Phase 2: • Time series contained in each FBL buffer is moved to the appropriate LBL buffers. • Sequentially for each FBL buffer, the algorithm reads the time series and creates the entire sub tree (with Internal and leaf iSAX 2.0 nodes) rooted at the node corresponding to that FBL buffer. • By emptying the right-most FBL buffer, we create the sub tree rooted at internal node I1. The algorithm also creates for each leaf node a corresponding LBL buffer • After all Time series of a FBL buffer moves to corresponding LBL buffers, these LBL are flushed to disk making memory available for use. iSAX 2.0 INDEX | 17
  • 19. THE iSAX 2.0 INDEX Bulk Loading: Algorithm Description Phase 2: • At the end of Phase 2, all the time series from the FBL buffers have moved down the tree to the appropriate leaf nodes and LBL buffers, and then from the LBL buffers to the disk. • Now both FBL and LBL are empty, and we are ready for next iteration of our algorithm going back to Phase 1. • This process continues until the entire dataset has been indexed. iSAX 2.0 INDEX | 18
  • 20. THE iSAX 2.0 INDEX Node Splitting Policy Design Principles • Keep the index as small as can • Avoid poor utilization of leaf nodes thus reducing the length of index • Make splits that distribute Time series equally to children nodes Approach for Proposed solution • Examine for each segment the distributions of the highest cardinality symbols across the relevant time series. • Split the segment for which the highest cardinality iSAX symbols lie on both sides of the breakpoint Solution • Split segment for which breakpoint is within range μ ± 3σ and closest to μ where σ is Standard deviation and μ is mean for the Symbol iSAX 2.0 INDEX | 19
  • 21. THE iSAX 2.0 INDEX Node Splitting Policy Example • iSAX word of length (segments) four, split a node with cardinality 2 (for all segments). Compute μ ± 3σ value for each segment • Segment 1 lies entirely below the lowest breakpoint of cardinality 4 (i.e., the cardinality of the two new nodes after the split) ∴ No Split • Segments 2 and 3 ranges cross some breakpoint of cardinality 4. Split on segment 3, because its μ value lies closer to a breakpoint than that of segment 2. • This tells some of the time series in the node to be split will end up in the new node representing the area above the breakpoint, while the rest will move to the second new node thus, achieving a balanced split. iSAX 2.0 INDEX | 20
  • 22. THE iSAX 2.0 INDEX Node Splitting Policy iSAX 2.0 INDEX | 21
  • 23. THE iSAX 2.0 INDEX Psuedo Code Bulk Loading Algorithm Node splitting Algorithm iSAX 2.0 INDEX | 22
  • 24. Experimental Evaluation Experimental Framework • Intel Xeon E5504 , 24 GB Memory, 2TB disk, Windows Vista SP2 • AMD Athlon 64 X2, 3 GB Memory, 400 GB disk, Windows XP SP2 • Implemented in C#/.Net 3.5 Other Algorithms used for comparison • iSAX 2.0 compared with iSAX and iSAX Buffer Tree Datasets • Up to 1 Billion random time series of size 256 (~ 2TB) • 20 Million subsequences of length 176 • 22 Million DNA subsequences of size 640 • 70 Million web image color histograms of size 256 (~133 GB) EXPERIMENTAL EVALUATION | 23
  • 25. Experimental Evaluation Splitting Policy • On an average 34% less nodes than iSAX index • On an average 30% less time to build an index • Leaf nodes with an average of over 54% more occupancy EXPERIMENTAL EVALUATION | 24
  • 26. Experimental Evaluation Bulk Loading Evaluation • Index built time for 1 Billion time series reduced to 72% From 2 months to 16 days • New strategy leads to 35% less disk usage • More than 99.5% sequential disk page accesses EXPERIMENTAL EVALUATION | 25
  • 27. Experimental Evaluation Case Study in Entomology • Early detection of harmful insect behavior • 20 Million Time series of size 176 • 6 Hours to index (~ 27 GB disk space), 0.5 second to answer 1-NN query Mining Massive DNA Sequences • Investigating genetic relationships • 22 Million Time series of size 640 • 9 Hours to index (~ 115 GB disk space), 0.6 second to answer 10-NN query Mining Massive Image collections • Finding similar images • 70 Million web image color histograms of size 256 (~133 GB) • 12 Hours to index (~ 133 GB disk space), 0.9 second to answer 1-NN query EXPERIMENTAL EVALUATION | 26
  • 28. Experimental Evaluation Case Study in Entomology Mining Massive DNA Sequences Mining Massive Image collections EXPERIMENTAL EVALUATION | 27
  • 29. Conclusions and future work Conclusions Proposed iSAX 2.0 • indexing for billion size time series collections Experimentally validated proposed approach • First published experiment with analysis over 1 Billion time series Case studies in diverse domains shows usefulness of this approach • Analysis of vast collection of data in fields like Etymology, web mining, DNA sequencing etc. Future Work Index and mine larger time series of order 10 Billion or more • e.g. Functional magnetic resonance imaging (fMRI) • Single experiment (1 subject, 1 test) produces 60,000 TS of length 3000 leading to 12GB CONCLUSIONS & FUTURE WORK | 28
  • 30. REFERENCES • iSAX 2.0 Indexing and Mining Billion time Series : Alessandro Camerra, Themis Palpanas, Jin Shieh and Eamonn Keogh (2010) ICDM 2010 [pdf] • Microsoft Research iSAX 2.0: Indexing and Mining One Billion Time Series MSR URL • Images Source 1, Source 2 REFERENCES | 29
  • 31. THANK YOU Q/A REFERENCES | 29