Spark-ITS: Indexing for Large-Scale Time Series Data on Spark with Liang Zhang

Liang Zhang (lzhang6@wpi.edu)
Data Science Dept.,
Worcester Polytechnic Institute
Spark-ITS:
Indexing for Large-Scale Time
Series Data on Spark
#SAISEco5

Prof. Elke A. Rundensteiner
Prof. Mohamed Y. Eltabakh
Liang Zhang
Noura Alghamdi
Data Science Research Group @ Worcester Polytechnic Institute
Liang Zhang, Noura Alghamdi, Mohamed Y. Eltabakh, Elke A. Rundensteiner. TARDIS: Distributed Indexing Framework for
Big Time Series Data. Proceedings of 35th IEEE International Conference on Data Engineering ICDE, 2019

Outline
• Motivation
• Background
• Spark-ITS Framework
– Overview
– Index Construction
– Query Processing
• Performance Evaluation
3

Time Series are Continuously Produced Everywhere
• How to deal with billions
of time series?
4
Climate data Web log
Stock priceEEG

Almost all Time Series Data Mining Tasks rely on
Similarity Query
5
Esling, Philippe, and Carlos Agon. "Time-series data mining." ACM (CSUR) 45.1 (2012): 12.
Clustering Motif Discovery
Classification
çΩ
Outlier Detection
Whole Matching
Subsequence Matching

Spark-ITS
• A new Index Tree and an effective Signature to simplify
the cardinality conversion and keep better similarity
• A Distributed Index Framework to support large-scale
time series dataset
• Efficient algorithms for Exact Match and kNN
Approximate queries process
6

Spark-ITS Overview
7
Global Index
Indexed Data
Local Index
Query
Partition
1. Sampling
2. Node Statistic
3. Build Index Tree
4. Assign Partition ID
1. Construct Local Structure
2. Construct Bloom Filter
1. Read and convert data
2. Shuffle data

Background: iSAX Representation
8
Shieh, Jin, and Eamonn Keogh. "iSAX: indexing and mining terabyte sized time series." SIGKDD ACM, 2008.
Camerra, A., Palpanas, T., Shieh, J., & Keogh, E. "iSAX 2.0: Indexing and mining one billion time series." ICDM, 2010
A time series of length 16
PAA representation with 4
segments
PAA: Piecewise Aggregate Approximation
iSAX: indexable Symbolic Aggregate approXimation
SAX representation with
4 segments and
cardinality 4
[11,10,01,00]
iSAX representation
with 4 segments and
variable cardinality
[1", 1", 𝟎𝟏 𝟒, 0"]

Word-level Similarity
9
-2
-1
0
1
2
1 3 5 7 9 11
A B C 111
110
101
100
011
010
001
000
-2
-1
0
1
2
1 3 5 7 9 11
A B C 111
110
101
100
011
010
001
000
State-of-the-art: Character-level Similarity Proposed: Word-level Similarity
B and C
are similar
A and C
are similar
A: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟏 𝟑, 𝟏 𝟏]
B: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟎 𝟑, 𝟏 𝟏]
C: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟎 𝟑, 𝟏 𝟏]
A: [𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟏𝟎 𝟐]
B: [𝟎𝟎 𝟐, 𝟎𝟎 𝟐, 𝟎𝟏 𝟐, 𝟏𝟏 𝟐]
C: [𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟏𝟎 𝟐]

New Index Tree Supports Word-level Similarity
10
Proposed: iSAX-T K-ary TreeState-of-the-art: iSAX Binary Tree
Root
0", 1",0" 1", 1", 1"0", 0", 0"
0", 11%,0" 0", 10%, 0"
0",11%, 01% 0", 11%,00%
. . .. . .
Leaf nodeInternal node
1st
bit
2nd
bit
3rd
bit
Root
1",1",0" 1",1",1"0",0",0"
01$, 00$, 10$ 01$, 01$, 11$
010&,000&,100&
. . .
0",0",1"
. . .
00$, 00$, 10$
011&, 001&, 101&
. . .
Leaf nodeInternal node

iSAX-T(Transpose) Signature
11
iSAX-T
SAX(T,4,16) = {1100, 1101, 0110, 0001} = CE25
SAX(T,4,8) = {110, 110, 011, 000 } = CE2
SAX(T,4,4) = {11, 11, 01, 00 } = CE
SAX(T,4,2) = {1, 1, 0, 0 } = C
HexTranspose
1 1 0 0
1 1 0 1
0 1 1 0
0 0 0 1
1 1 0 0
1 1 1 0
0 0 1 0
0 1 0 1
C
E
2
5
Time series:
[1100, 1101, 0110, 0001]

Outline
• Motivation
• Background
– Overview
12

1.368099 2.713573 -4.851872 -2.710113 -5.577432 0.797747
-0.998534 0.535733 -2.244053 -0.298195 -5.040225 -0.093288
2.683385 -4.839688 5.617443 -0.087439 -0.857566 2.537812
-3.809641 0.638194 0.706312 -3.016157 -3.094813 4.975719
-3.664357 1.402586 0.444090 1.969943 1.282233 1.912557
2.277926 1.511366 0.945206 5.769843 0.406734 -4.205288
0.850925 -2.994073 1.270280 1.286681 -5.681450 3.137617
-4.996282 3.160174 -8.749059 2.648822 6.117611 3.109095
3.159747 0.442472 -1.482878 3.432288 0.960204 0.380183
-3.925308 -2.112708 -2.991460 -3.692369 4.508871 6.430551
-3.929611 -4.271633 0.268938 -1.756457 0.978831 0.783966
-1.982449 1.100825 -6.741050 0.882729 -3.098735 -0.330746
-0.659716 1.345305 -1.537599 -0.639539 2.028107 3.638267
0.211225 5.067515 -0.479032 2.713979 -2.921332 1.231413
-1.559693 -1.057173 -0.335133 -3.601023 -4.891684 -1.832524
-2.828772 0.257098 2.288298 -4.795566 -0.054114 1.991941
Global Index[1/4]: Sampling
1.368099 2.713573 -4.851872 -2.710113 -5.577432 0.797747
-0.998534 0.535733 -2.244053 -0.298195 -5.040225 -0.093288
2.683385 -4.839688 5.617443 -0.087439 -0.857566 2.537812
-3.809641 0.638194 0.706312 -3.016157 -3.094813 4.975719
0.594820 1.136821 0.163368 5.379237 -5.453637 -0.282540
0.572556 6.158454 -1.632961 -1.560935 2.514265 2.987787
-3.006184 0.965107 4.543610 0.614290 1.851868 -2.935539
-0.716928 2.357205 -3.126861 1.620514 -0.490122 -3.380533
-2.301087 -4.727099 6.885664 -5.210190 1.707254 -7.965270
-0.914942 0.622116 1.620520 -0.994487 -0.021151 -1.749576
-3.664357 1.402586 0.444090 1.969943 1.282233 1.912557
2.277926 1.511366 0.945206 5.769843 0.406734 -4.205288
0.850925 -2.994073 1.270280 1.286681 -5.681450 3.137617
-4.996282 3.160174 -8.749059 2.648822 6.117611 3.109095
-3.164684 0.884269 2.925519 -1.051656 -0.371788 -1.661374
0.041967 0.126226 -5.662528 -1.026395 -1.317764 2.268905
-2.998881 2.628193 -4.195228 2.261641 -1.676540 -1.646810
-0.056534 -1.551837 -5.098098 2.857196 -2.981121 -0.559482
-0.573813 2.996416 -2.567590 3.113241 2.385687 -1.195035
-4.255606 2.898200 -2.443996 1.196084 -0.759899 -2.065437
-0.709810 -2.848963 4.183200 -0.901386 3.303330 -2.852400
0.022771 1.184107 1.556879 1.825932 -2.375073 1.430655
-1.844884 0.074289 3.309890 1.141529 2.022026 0.552751
3.941867 -4.384278 -0.374243 1.231169 3.094143 1.208599
-1.893548 4.995226 -4.282130 -1.408826 4.439037 1.134620
3.159747 0.442472 -1.482878 3.432288 0.960204 0.380183
-3.925308 -2.112708 -2.991460 -3.692369 4.508871 6.430551
-3.929611 -4.271633 0.268938 -1.756457 0.978831 0.783966
-1.982449 1.100825 -6.741050 0.882729 -3.098735 -0.330746
1.041127 -3.140574 -1.436662 2.035271 0.203884 -0.821091
-0.659716 1.345305 -1.537599 -0.639539 2.028107 3.638267
0.211225 5.067515 -0.479032 2.713979 -2.921332 1.231413
-1.559693 -1.057173 -0.335133 -3.601023 -4.891684 -1.832524
-2.828772 0.257098 2.288298 -4.795566 -0.054114 1.991941
-3.091072 4.949271 -0.935447 3.327516 -3.299987 4.897994
-2.998881 2.628193 -4.195228 2.261641 -1.676540 -1.646810
-0.056534 -1.551837 -5.098098 2.857196 -2.981121 -0.559482
-0.573813 2.996416 -2.567590 3.113241 2.385687 -1.195035
-4.255606 2.898200 -2.443996 1.196084 -0.759899 -2.065437
-0.709810 -2.848963 4.183200 -0.901386 3.303330 -2.852400
0.022771 1.184107 1.556879 1.825932 -2.375073 1.430655
-1.844884 0.074289 3.309890 1.141529 2.022026 0.552751
3.941867 -4.384278 -0.374243 1.231169 3.094143 1.208599
-1.893548 4.995226 -4.282130 -1.408826 4.439037 1.134620
3.159747 0.442472 -1.482878 3.432288 0.960204 0.380183
13
Sampling Map Reduce
iSAX-T(b bits), Freq:1Time series iSAX-T(b bits), Freq(b bits)
1256ae3e , 1
0134ef45 , 1
234567ae , 1
1256ae3e , 1
1256ae3e , 1
234567ae , 1
237867ae , 1
024567ae , 1
6243371e , 1
……
452167ef , 1
1256ae3e , 23
0134ef45 , 2
234567ae , 20
024567ae , 4
237867ae , 10
……
452167ef , 10
Segment Number: 8, so use 2 letters to represent 1 bit
Initial cardinality: b bit level
The data size is based on 1 billion time series with 256 length
Word counting MapReduce process
1 Terabyte
100 G
0.9 G
HDFS
0.1 G

Global Index[2/4]: Node Statistic
14
(iSAX-T(b),Freq(b)) (iSAX-T(1),Freq(b)) [(iSAX-T(1),Freq(1))]
max(Freq(1))
Map Reduce
Judge
1st layer:
2nd layer:
Filter
(iSAX-T(b),Freq(b)) (iSAX-T(2),Freq(b)) [(iSAX-T(2),Freq(2))]
max(Freq(2))
Map Reduce
Judge
Filter
3rd layer: (iSAX-T(b),Freq(b)) (iSAX-T(3),Freq(b)) [(iSAX-T(3),Freq(3))]
max(Freq(3))
Map Reduce
Judge
……

Global Index[3/4]: Build Tree
15
Root
iSAX-T: 01
Freq: 512
. . .
Segment number: 8
Partition Capacity: 100,000
iSAX-T: 02
Freq: 350,000
iSAX-T: 03
Freq: 4,352
iSAX-T: ff
Freq: 270,520
iSAX-T: 0201
Freq: 5,012
iSAX-T: 0202
Freq: 100,550
iSAX-T: 02ff
Freq: 620. . .
iSAX-T: 020201
Freq: 12
iSAX-T: 020202
Freq: 550
iSAX-T: 0202ff
Freq: 620
. . .
• (“01”, 512)
• (“02”, 355,000)
• ….
• (“ff”, 270,520)
• (“0201”, 5,012)
• (“0202”, 100,550)
• ….
• (“ffff”, 10,520)
• (“020201”, 12)
• (“020202”, 550)
• ….
• (“0202ff”, 620)
1st layer (iSAX-T, Freq)
2nd layer (iSAX-T, Freq)
3rd layer (iSAX-T, Freq)

Global Index[4/4]: Assign Partition Id to Leaf Nodes
16
Bin Packing Problem:
How to fit a set of nodes in the smallest numbers of partitions?
Partition capacity: 100,000
Partition ID: 1 Partition ID: 2 Partition ID: 3
iSAX-T: 02
Freq: 390,500
iSAX-T: 0201
Freq: 70,000
iSAX-T: 0202
Freq: 50,000
iSAX-T: 0203
Freq: 20,000
iSAX-T: 0204
Freq: 40,000
iSAX-T: 0205
Freq: 80,000
iSAX-T: 0206
Freq: 130,500
iSAX-T: 0202
Freq: 50,000
iSAX-T: 0204
Freq: 40,000
iSAX-T: 0201
Freq: 70,000
iSAX-T: 0203
Freq: 20,000
iSAX-T: 0205
Freq: 80,000

Repartition: Wrap Global Index as the Partitioner
17
Root
iSAX-T: 01
Freq: 512
pid: 1
. . .
iSAX-T: 02
Freq: 350,000
pid: 5,6,7
iSAX-T: 03
Freq: 4,352
pid:1
iSAX-T: ff
Freq: 360,520
pid: 10,11,12
iSAX-T: 0201
Freq: 5,012
pid: 5
iSAX-T: 0202
Freq: 100,550
pid: 6,7
iSAX-T: 02ff
Freq: 620
pid: 5
. . .
iSAX-T: 020201
Freq: 12
pid: 6
iSAX-T: 020202
Freq: 550
pid:6
iSAX-T: 0202ff
Freq: 620
. . .
iSAX-T: 0202ff45
TS: [0.34, 0.31, 1.14…]
iSAX-T: 0202ff45
A Time Series
iSAX-T: 0202ff45
iSAX-T: 0202ff45

Local Index: Construction Within Each Partition
18
Partition capacity: 100,000
Node split threshold: 1000
Segment Number: 8
abcd
Freq:5000
ab3c
Freq: 450
Root
Freq: 90,990
abcd12
Freq: 1010
abcd45
Freq: 96
…
iSAX-T, ts, rid
iSAX-T, ts, rid
iSAX-T, ts, rid
iSAX-T, ts, rid
iSAX-T, ts, rid
…
iSAX-T, ts, rid
iSAX-T, ts, rid
iSAX-T, ts, rid
……
iSAX-T, ts, rid
iSAX-T, ts, rid
abcd12ff
Freq: 30
abcd1201
Freq: 42
……
ab45
Freq:4000
……
…
iSAX-T, ts, rid
iSAX-T, ts, rid
iSAX-T, ts, rid
…
iSAX-T, ts, rid
iSAX-T, ts, rid
iSAX-T, ts, rid
iSAX-T: abcd12ef34, ….
+1
+1
+1
1.368099 2.713573 -4.851872 -2.710113
-
5.577432
0.797747
-
0.998534
0.535733 -2.244053
-
0.298195
-5.040225 -0.093288
2.683385
-
4.839688
5.617443
-
0.087439
-
0.857566
2.537812
-
3.809641
0.638194 0.706312
-
3.016157
-
3.094813
4.975719
0.594820 1.136821 0.163368 5.379237
-
5.453637
-
0.282540
0.572556 6.158454
-
1.632961
-
1.560935
2.514265 2.987787
-3.006184 0.965107 4.543610 0.614290 1.851868
-
2.935539
-
0.716928
2.357205 -3.126861 1.620514 -0.490122
-
3.380533
-2.301087
-
4.727099
6.885664 -5.210190 1.707254
-
7.965270
-0.914942 0.622116 1.620520
-
0.994487
-
0.021151
-
1.749576
-
3.664357
1.402586 0.444090 1.969943 1.282233 1.912557
2.277926 1.511366 0.945206 5.769843 0.406734 -4.205288
0.850925
-
2.994073
1.270280 1.286681
-
5.681450
3.137617
-
4.996282
3.160174
-
8.749059
2.648822 6.117611 3.109095
-
3.164684
0.884269 2.925519
-
1.051656
-
0.371788
-
1.661374
0.041967 0.126226
-
5.662528
-1.026395
-
1.317764
2.268905
-
2.998881
2.628193
-
4.195228
2.261641
-
1.676540
-
1.646810
-
0.056534
-
1.551837
-
5.098098
2.857196 -2.981121
-
0.559482
-
0.573813
2.996416
-
2.567590
3.113241 2.385687
-
1.195035
-
4.255606
2.898200
-
2.443996
1.196084
-
0.759899
-
2.065437
-
0.709810
-
2.848963
4.183200
-
0.901386
3.303330
-
2.852400
0.022771 1.184107 1.556879 1.825932
-
2.375073
1.430655
-
1.844884
0.074289 3.309890 1.141529 2.022026 0.552751
3.941867 -4.384278
-
0.374243
1.231169 3.094143 1.208599
-
1.893548
4.995226 -4.282130
-
1.408826
4.439037 1.134620
3.159747 0.442472
-
1.482878
3.432288 0.960204 0.380183
-
3.925308
-
2.112708
-
2.991460
-
3.692369
4.508871 6.430551
-
3.929611
-
4.271633
0.268938
-
1.756457
0.978831 0.783966
-
1.982449
1.100825
-
6.741050
0.882729
-
3.098735
-
0.330746
1.041127
-
3.140574
-
1.436662
2.035271 0.203884
-
0.821091
-
0.659716
1.345305
-
1.537599
-
0.639539
2.028107 3.638267
0.211225 5.067515
-
0.479032
2.713979 -2.921332 1.231413
-
1.559693
-
1.057173
-
0.335133
-3.601023
-
4.891684
-
1.832524
-
2.828772
0.257098 2.288298
-
4.795566
-
0.054114
1.991941
-
3.091072
4.949271
-
0.935447
3.327516 -3.299987 4.897994
1.306157 1.228019 -2.920305 0.710852
-
2.590932
-
3.644530
Time series
in one partition
Local Index Bloom Filter
iSAX-T: abcd12ef34, ….

Outline
• Motivation
• Background
– Overview
19

000*,001*,001*
Isax-t: 002
Exact Matching Query
20
Local Index
Master
Global Index
Query
Records in
leaf node
euDist = 0
Pid
6 Bloom Filter
No
Yes
Exist?
Pid:6
Worker
Partition 6

KNN Approximate Query: One Partition Access
21
Worker
000*,001*,001*
Isax-t: 002
Local Index
Master
iSAX-T Skeleton
Query
Pid
Pid:6
(euDist, rid)1. euDist
2. sort
3. take Top(K)Records in leaf
/ internal node
Partition 6
Records in
leaf/internal
node
1. euDist
2. sort
3. Top(K) dist
as
threshold

KNN Approximate Query: Multi-Partitions Access
22
000*,001*,001*
iSAX-T: 002
Partition 6
Local Index
Master
Worker
(euDist,rid)
Local Index
1. euDist
2. sort
3. take Top(K)
Sibling
Pid List
iSAX-T Skeleton
Records in
leaf/internal
node
1. euDist
2. sort
3. Top(K) dist
as
threshold
Records in leaf
/ internal node
Query

Outline
• Motivation
• Background
– Overview
23

Experimental Setup
24
Dataset Size Length
Random Walk 1 billion 256
Texmex 1 billion 128
DNA 200 million 192
Noaa Climate 200 million 64
HW&SW Configuration
Spark 2.0.2, Standalone mode
Hadoop 2.7.3
Platform Ubuntu 16.04. LTS
HW 2 nodes, each node consist of 56 Xeon
E5 processors, 500G RAM, 7TB SATA
hard drive
The dataset is normalized
Each point is saved as float format
Source:
1. http://corpus-texmex.irisa.fr/
2. https://genmone.ucsc.edu
3. https://www.ncdc.gov/
1
2
3
State-of-the-Art: Yagoubi, Djamel-Edine, et al. "DPiSAX:
Massively Distributed Partitioned iSAX." ICDM 2017
The initial cardinality of the baseline system is the default value
and it needs a large initial value to guarantee enough bit level for
binary split.
Baseline Spark-ITS
Initial cardinality 512 64
Word length 8 8
Sampling percent 10% 10%
Leaf node split threshold
of Local index 1000 1000

Index Construction Time
25
0
500
1,000
1,500
2,000
200m 400m 600m 800m 1b 200m 400m 600m 800m 1b
Spark-ITS Baseline
(Minuts)
#Time Series
Global Index
Local Index
2,323
334
Dataset: Random Walk Benchmark
80+%
State-of-the-Art

0
10
20
30
40
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
200m 400m 600m 800m 1b
(Minutes)
sampling
statistic
build index
assign Pid
State-of-Art
State-of-Art
State-of-Art
State-of-Art
State-of-Art
Sampling
Statistic
Build Index
Assign Pid
Index Construction Time: Breakdown
26
Global Index Time Breakdown Repartition and Local Index Time Breakdown
Dataset: Random Walk Benchmark
0
500
1,000
1,500
2,000
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
200m 400m 600m 800m 1b
(Minutes)
iSAX read and convert
Shuffle and build Local index
Read and conversion
State-of-Art
State-of-Art
State-of-Art
State-of-Art
State-of-Art
Shuffle and Build Local Index

Exact Matching Query
27
State-of-
the-Art
State-of-the-Art

kNN-Approximate Query Performance
28
0%
15%
30%
45%
60%
RandomWalk Texmex DNA Noaa
✕30
✕ 35 ✕ 28
✕ 72
1.0
1.3
1.6
1.9
2.2
RandomWalk Texmex DNA Noaa
27%
26%
48%
34%
Recall Error Ratio
Dataset (#Time series) Dataset (#Time series)
(400m) (400m) (200m) (200m) (400m) (400m) (200m) (200m)
State-of-the-Art

Conclusion
• Index Tree
– Large fan-out decreases the depth of leaf nodes
– Keeps better similarity at Word-level
– The signature simplifies the conversion of cardinality
• Spark-ITS: Index Construction
– Block-sampling and node statistic collection to fast build global index
– Synchronously build local indices within a partition
– Constructs Index faster 80+%.
• Spark-ITS: Query
– Exact Matching: the time decreases by 50%.
– kNN approximate: the accuracy increases more than 10 fold.
29

Acknowledge Funding from...
Xianjin Tech Co., Ltd.
Saudi Arabian Cultural Mission
WPI Computer Science Dept.,
NSF CNS: 305258 II-EN
NSF CRI: 0551584

Spark-ITS: Indexing for Large-Scale Time Series Data on Spark with Liang Zhang

More Related Content

Similar to Spark-ITS: Indexing for Large-Scale Time Series Data on Spark with Liang Zhang (20)

More from Databricks (20)

Recently uploaded (20)

Spark-ITS: Indexing for Large-Scale Time Series Data on Spark with Liang Zhang