SlideShare a Scribd company logo
Liang Zhang (lzhang6@wpi.edu)
Data Science Dept.,
Worcester Polytechnic Institute
Spark-ITS:
Indexing for Large-Scale Time
Series Data on Spark
#SAISEco5
Prof. Elke A. Rundensteiner
Prof. Mohamed Y. Eltabakh
Liang Zhang
Noura Alghamdi
Data Science Research Group @ Worcester Polytechnic Institute
Liang Zhang, Noura Alghamdi, Mohamed Y. Eltabakh, Elke A. Rundensteiner. TARDIS: Distributed Indexing Framework for
Big Time Series Data. Proceedings of 35th IEEE International Conference on Data Engineering ICDE, 2019
Outline
• Motivation
• Background
• Spark-ITS Framework
– Overview
– Index Construction
– Query Processing
• Performance Evaluation
3
Time Series are Continuously Produced Everywhere
• How to deal with billions
of time series?
4
Climate	data Web	log
Stock	priceEEG
Almost all Time Series Data Mining Tasks rely on
Similarity Query
5
Esling,	Philippe,	and	Carlos	Agon.	"Time-series	data	mining." ACM	(CSUR) 45.1	(2012):	12.
Clustering Motif	Discovery
Classification
çΩ
Outlier Detection
Whole	Matching
Subsequence	Matching
Spark-ITS
• A new Index Tree and an effective Signature to simplify
the cardinality conversion and keep better similarity
• A Distributed Index Framework to support large-scale
time series dataset
• Efficient algorithms for Exact Match and kNN
Approximate queries process
6
Spark-ITS Overview
7
Global Index
Indexed	Data
Local Index
Query
Partition
1. Sampling
2. Node Statistic
3. Build Index Tree
4. Assign Partition ID
1. Construct Local Structure
2. Construct Bloom Filter
1. Read and convert data
2. Shuffle data
Background:	iSAX	Representation
8
Shieh, Jin, and Eamonn Keogh. "iSAX: indexing and mining terabyte sized time series." SIGKDD ACM, 2008.
Camerra, A., Palpanas, T., Shieh, J., & Keogh, E. "iSAX 2.0: Indexing and mining one billion time series." ICDM, 2010
A time series of length 16
PAA representation with 4
segments
PAA:	Piecewise	Aggregate	Approximation
iSAX:	indexable Symbolic	Aggregate	approXimation
SAX	representation	with	
4	segments	and	
cardinality 4
[11,10,01,00]
iSAX	representation	
with	4	segments	and	
variable	cardinality
[1",	1",	𝟎𝟏 𝟒,	0"]
Word-level Similarity
9
-2	
-1	
0
1
2
1 3 5 7 9 11
A B C 111
110
101
100
011
010
001
000
-2	
-1	
0
1
2
1 3 5 7 9 11
A B C 111
110
101
100
011
010
001
000
State-of-the-art: Character-level	Similarity Proposed: Word-level	Similarity
B and C
are similar
A and C
are similar
A: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟏 𝟑, 𝟏 𝟏]
B: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟎 𝟑, 𝟏 𝟏]
C: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟎 𝟑, 𝟏 𝟏]
A: [𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟏𝟎 𝟐]
B: [𝟎𝟎 𝟐, 𝟎𝟎 𝟐, 𝟎𝟏 𝟐, 𝟏𝟏 𝟐]
C: [𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟏𝟎 𝟐]
New Index Tree Supports Word-level Similarity
10
Proposed: iSAX-T	K-ary TreeState-of-the-art: iSAX	Binary	Tree
Root
0", 1",0" 1", 1", 1"0", 0", 0"
0", 11%,0" 0", 10%, 0"
0",11%, 01% 0", 11%,00%
. . .. . .
Leaf nodeInternal node
1st
bit
2nd
bit
3rd
bit
Root
1",1",0" 1",1",1"0",0",0"
01$, 00$, 10$ 01$, 01$, 11$
010&,000&,100&
. . .
0",0",1"
. . .
00$, 00$, 10$
011&, 001&, 101&
. . .
Leaf nodeInternal node
iSAX-T(Transpose) Signature
11
iSAX-T
SAX(T,4,16) = {1100, 1101, 0110, 0001} = CE25
SAX(T,4,8) = {110, 110, 011, 000 } = CE2
SAX(T,4,4) = {11, 11, 01, 00 } = CE
SAX(T,4,2) = {1, 1, 0, 0 } = C
HexTranspose
1 1 0 0
1 1 0 1
0 1 1 0
0 0 0 1
1 1 0 0
1 1 1 0
0 0 1 0
0 1 0 1
C
E
2
5
Time series:
[1100, 1101, 0110, 0001]
Outline
• Motivation
• Background
• Spark-ITS Framework
– Overview
– Index Construction
– Query Processing
• Performance Evaluation
12
1.368099 2.713573 -4.851872 -2.710113 -5.577432 0.797747
-0.998534 0.535733 -2.244053 -0.298195 -5.040225 -0.093288
2.683385 -4.839688 5.617443 -0.087439 -0.857566 2.537812
-3.809641 0.638194 0.706312 -3.016157 -3.094813 4.975719
-3.664357 1.402586 0.444090 1.969943 1.282233 1.912557
2.277926 1.511366 0.945206 5.769843 0.406734 -4.205288
0.850925 -2.994073 1.270280 1.286681 -5.681450 3.137617
-4.996282 3.160174 -8.749059 2.648822 6.117611 3.109095
3.159747 0.442472 -1.482878 3.432288 0.960204 0.380183
-3.925308 -2.112708 -2.991460 -3.692369 4.508871 6.430551
-3.929611 -4.271633 0.268938 -1.756457 0.978831 0.783966
-1.982449 1.100825 -6.741050 0.882729 -3.098735 -0.330746
-0.659716 1.345305 -1.537599 -0.639539 2.028107 3.638267
0.211225 5.067515 -0.479032 2.713979 -2.921332 1.231413
-1.559693 -1.057173 -0.335133 -3.601023 -4.891684 -1.832524
-2.828772 0.257098 2.288298 -4.795566 -0.054114 1.991941
Global Index[1/4]: Sampling
1.368099 2.713573 -4.851872 -2.710113 -5.577432 0.797747
-0.998534 0.535733 -2.244053 -0.298195 -5.040225 -0.093288
2.683385 -4.839688 5.617443 -0.087439 -0.857566 2.537812
-3.809641 0.638194 0.706312 -3.016157 -3.094813 4.975719
0.594820 1.136821 0.163368 5.379237 -5.453637 -0.282540
0.572556 6.158454 -1.632961 -1.560935 2.514265 2.987787
-3.006184 0.965107 4.543610 0.614290 1.851868 -2.935539
-0.716928 2.357205 -3.126861 1.620514 -0.490122 -3.380533
-2.301087 -4.727099 6.885664 -5.210190 1.707254 -7.965270
-0.914942 0.622116 1.620520 -0.994487 -0.021151 -1.749576
-3.664357 1.402586 0.444090 1.969943 1.282233 1.912557
2.277926 1.511366 0.945206 5.769843 0.406734 -4.205288
0.850925 -2.994073 1.270280 1.286681 -5.681450 3.137617
-4.996282 3.160174 -8.749059 2.648822 6.117611 3.109095
-3.164684 0.884269 2.925519 -1.051656 -0.371788 -1.661374
0.041967 0.126226 -5.662528 -1.026395 -1.317764 2.268905
-2.998881 2.628193 -4.195228 2.261641 -1.676540 -1.646810
-0.056534 -1.551837 -5.098098 2.857196 -2.981121 -0.559482
-0.573813 2.996416 -2.567590 3.113241 2.385687 -1.195035
-4.255606 2.898200 -2.443996 1.196084 -0.759899 -2.065437
-0.709810 -2.848963 4.183200 -0.901386 3.303330 -2.852400
0.022771 1.184107 1.556879 1.825932 -2.375073 1.430655
-1.844884 0.074289 3.309890 1.141529 2.022026 0.552751
3.941867 -4.384278 -0.374243 1.231169 3.094143 1.208599
-1.893548 4.995226 -4.282130 -1.408826 4.439037 1.134620
3.159747 0.442472 -1.482878 3.432288 0.960204 0.380183
-3.925308 -2.112708 -2.991460 -3.692369 4.508871 6.430551
-3.929611 -4.271633 0.268938 -1.756457 0.978831 0.783966
-1.982449 1.100825 -6.741050 0.882729 -3.098735 -0.330746
1.041127 -3.140574 -1.436662 2.035271 0.203884 -0.821091
-0.659716 1.345305 -1.537599 -0.639539 2.028107 3.638267
0.211225 5.067515 -0.479032 2.713979 -2.921332 1.231413
-1.559693 -1.057173 -0.335133 -3.601023 -4.891684 -1.832524
-2.828772 0.257098 2.288298 -4.795566 -0.054114 1.991941
-3.091072 4.949271 -0.935447 3.327516 -3.299987 4.897994
-2.998881 2.628193 -4.195228 2.261641 -1.676540 -1.646810
-0.056534 -1.551837 -5.098098 2.857196 -2.981121 -0.559482
-0.573813 2.996416 -2.567590 3.113241 2.385687 -1.195035
-4.255606 2.898200 -2.443996 1.196084 -0.759899 -2.065437
-0.709810 -2.848963 4.183200 -0.901386 3.303330 -2.852400
0.022771 1.184107 1.556879 1.825932 -2.375073 1.430655
-1.844884 0.074289 3.309890 1.141529 2.022026 0.552751
3.941867 -4.384278 -0.374243 1.231169 3.094143 1.208599
-1.893548 4.995226 -4.282130 -1.408826 4.439037 1.134620
3.159747 0.442472 -1.482878 3.432288 0.960204 0.380183
13
Sampling Map Reduce
iSAX-T(b bits),	Freq:1Time	series iSAX-T(b	bits),	Freq(b	bits)
1256ae3e , 1
0134ef45	 , 1
234567ae ,	 1	
1256ae3e , 1
1256ae3e , 1
234567ae , 1
237867ae ,	 1	
024567ae , 1
6243371e ,	 1	
……
452167ef , 1
1256ae3e , 23
0134ef45	 , 2
234567ae ,	 20	
024567ae , 4
237867ae ,	 10	
……
452167ef , 10
Segment	Number:	8,	so	use	2	letters	to	represent	1	bit
Initial	cardinality:	 b	bit	level
The	data	size	is	based	on	1	billion	time	series	with	256	length
Word	counting	MapReduce process
1	Terabyte
100	G
0.9	G
HDFS
0.1	G
Global Index[2/4]: Node Statistic
14
(iSAX-T(b),Freq(b)) (iSAX-T(1),Freq(b)) [(iSAX-T(1),Freq(1))]
max(Freq(1))
Map Reduce
Judge
1st layer:
2nd layer:
Filter
(iSAX-T(b),Freq(b)) (iSAX-T(2),Freq(b)) [(iSAX-T(2),Freq(2))]
max(Freq(2))
Map Reduce
Judge
Filter
3rd layer: (iSAX-T(b),Freq(b)) (iSAX-T(3),Freq(b)) [(iSAX-T(3),Freq(3))]
max(Freq(3))
Map Reduce
Judge
……
Global Index[3/4]: Build Tree
15
Root
iSAX-T:	01
Freq:	512
.	.	.	
Segment number: 8
Partition Capacity: 100,000
iSAX-T:	02
Freq:	350,000
iSAX-T:	03
Freq:	4,352
iSAX-T:	ff
Freq:	270,520
iSAX-T:	0201
Freq:	5,012
iSAX-T:	0202
Freq:	100,550
iSAX-T:	02ff
Freq:	620.	.	.	
iSAX-T:	020201
Freq:	12
iSAX-T:	020202
Freq:	550
iSAX-T:	0202ff
Freq:	620
.	.	.	
• (“01”, 512)
• (“02”, 355,000)
• ….
• (“ff”, 270,520)
• (“0201”, 5,012)
• (“0202”, 100,550)
• ….
• (“ffff”, 10,520)
• (“020201”, 12)
• (“020202”, 550)
• ….
• (“0202ff”, 620)
1st layer	(iSAX-T,	Freq)
2nd layer	(iSAX-T,	Freq)
3rd layer	(iSAX-T,	Freq)
Global Index[4/4]: Assign Partition Id to Leaf Nodes
16
Bin	Packing	Problem:	
How	to	fit	a	set	of	nodes	in	the	smallest	numbers	of	partitions?
Partition	capacity:	100,000
Partition	ID:	1 Partition	ID:	2 Partition	ID:	3
iSAX-T:	02
Freq:	390,500
iSAX-T:	0201
Freq:	70,000
iSAX-T:	0202
Freq:	50,000
iSAX-T:	0203
Freq:	20,000
iSAX-T:	0204
Freq:	40,000
iSAX-T:	0205
Freq:	80,000
iSAX-T:	0206
Freq:	130,500
iSAX-T:	0202
Freq:	50,000
iSAX-T:	0204
Freq:	40,000
iSAX-T:	0201
Freq:	70,000
iSAX-T:	0203
Freq:	20,000
iSAX-T:	0205
Freq:	80,000
Repartition: Wrap Global Index as the Partitioner
17
Root
iSAX-T:	01
Freq:	512
pid:	1
.	.	.	
iSAX-T:	02
Freq:	350,000
pid:	5,6,7
iSAX-T:	03
Freq:	4,352
pid:1
iSAX-T:	ff
Freq:	360,520
pid:	10,11,12
iSAX-T:	0201
Freq:	5,012
pid:	5
iSAX-T:	0202
Freq:	100,550
pid:	6,7
iSAX-T:	02ff
Freq:	620
pid:	5
.	.	.	
iSAX-T:	020201
Freq:	12
pid:	6
iSAX-T:	020202
Freq:	550
pid:6
iSAX-T:	0202ff
Freq:	620
.	.	.	
iSAX-T:	0202ff45
TS:							[0.34,	0.31,	1.14…]
iSAX-T:	0202ff45
A	Time	Series
iSAX-T:	0202ff45
iSAX-T:	0202ff45
Local Index: Construction Within Each Partition
18
Partition	capacity:	 100,000
Node	split	threshold:	 1000
Segment	Number:							8
abcd
Freq:5000
ab3c
Freq:	450
Root
Freq:	90,990
abcd12
Freq:	1010
abcd45
Freq:	96
…
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
…
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
……
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
abcd12ff
Freq:	30
abcd1201
Freq:	42
……
ab45
Freq:4000
……
…
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
…
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
iSAX-T,	ts,	rid
iSAX-T:	abcd12ef34,		….
+1
+1
+1
1.368099 2.713573 -4.851872 -2.710113
-
5.577432
0.797747
-
0.998534
0.535733 -2.244053
-
0.298195
-5.040225 -0.093288
2.683385
-
4.839688
5.617443
-
0.087439
-
0.857566
2.537812
-
3.809641
0.638194 0.706312
-
3.016157
-
3.094813
4.975719
0.594820 1.136821 0.163368 5.379237
-
5.453637
-
0.282540
0.572556 6.158454
-
1.632961
-
1.560935
2.514265 2.987787
-3.006184 0.965107 4.543610 0.614290 1.851868
-
2.935539
-
0.716928
2.357205 -3.126861 1.620514 -0.490122
-
3.380533
-2.301087
-
4.727099
6.885664 -5.210190 1.707254
-
7.965270
-0.914942 0.622116 1.620520
-
0.994487
-
0.021151
-
1.749576
-
3.664357
1.402586 0.444090 1.969943 1.282233 1.912557
2.277926 1.511366 0.945206 5.769843 0.406734 -4.205288
0.850925
-
2.994073
1.270280 1.286681
-
5.681450
3.137617
-
4.996282
3.160174
-
8.749059
2.648822 6.117611 3.109095
-
3.164684
0.884269 2.925519
-
1.051656
-
0.371788
-
1.661374
0.041967 0.126226
-
5.662528
-1.026395
-
1.317764
2.268905
-
2.998881
2.628193
-
4.195228
2.261641
-
1.676540
-
1.646810
-
0.056534
-
1.551837
-
5.098098
2.857196 -2.981121
-
0.559482
-
0.573813
2.996416
-
2.567590
3.113241 2.385687
-
1.195035
-
4.255606
2.898200
-
2.443996
1.196084
-
0.759899
-
2.065437
-
0.709810
-
2.848963
4.183200
-
0.901386
3.303330
-
2.852400
0.022771 1.184107 1.556879 1.825932
-
2.375073
1.430655
-
1.844884
0.074289 3.309890 1.141529 2.022026 0.552751
3.941867 -4.384278
-
0.374243
1.231169 3.094143 1.208599
-
1.893548
4.995226 -4.282130
-
1.408826
4.439037 1.134620
3.159747 0.442472
-
1.482878
3.432288 0.960204 0.380183
-
3.925308
-
2.112708
-
2.991460
-
3.692369
4.508871 6.430551
-
3.929611
-
4.271633
0.268938
-
1.756457
0.978831 0.783966
-
1.982449
1.100825
-
6.741050
0.882729
-
3.098735
-
0.330746
1.041127
-
3.140574
-
1.436662
2.035271 0.203884
-
0.821091
-
0.659716
1.345305
-
1.537599
-
0.639539
2.028107 3.638267
0.211225 5.067515
-
0.479032
2.713979 -2.921332 1.231413
-
1.559693
-
1.057173
-
0.335133
-3.601023
-
4.891684
-
1.832524
-
2.828772
0.257098 2.288298
-
4.795566
-
0.054114
1.991941
-
3.091072
4.949271
-
0.935447
3.327516 -3.299987 4.897994
1.306157 1.228019 -2.920305 0.710852
-
2.590932
-
3.644530
Time	series	
in	one	partition
Local	Index Bloom	Filter
iSAX-T:	abcd12ef34,		….
Outline
• Motivation
• Background
• Spark-ITS Framework
– Overview
– Index Construction
– Query Processing
• Performance Evaluation
19
000*,001*,001*
Isax-t: 002
Exact Matching Query
20
Local Index
Master
Global	Index
Query
Records	in
leaf node
euDist =	0
Pid
6 Bloom Filter
No
Yes
Exist?
Pid:6
Worker
Partition 6
KNN Approximate Query: One Partition Access
21
Worker
000*,001*,001*
Isax-t: 002
Local Index
Master
iSAX-T	Skeleton
Query
Pid
Pid:6
(euDist,	rid)1. euDist
2. sort	
3. take	Top(K)Records	in	leaf	
/	internal	node
Partition 6
Records	in	
leaf/internal
node
1. euDist
2. sort	
3. Top(K)	dist
as	
threshold
KNN Approximate Query: Multi-Partitions Access
22
000*,001*,001*
iSAX-T: 002
Partition	6
Local Index
Master
Worker
(euDist,rid)
Local Index
1. euDist
2. sort	
3. take	Top(K)
Sibling
Pid List
iSAX-T	Skeleton
Records	in	
leaf/internal
node
1. euDist
2. sort	
3. Top(K)	dist
as	
threshold
Records	in	leaf	
/	internal	node
Query
Outline
• Motivation
• Background
• Spark-ITS Framework
– Overview
– Index Construction
– Query Processing
• Performance Evaluation
23
Experimental Setup
24
Dataset Size Length
Random	Walk	 1	billion 256
Texmex 1	billion 128
DNA 200	million 192
Noaa Climate 200	million 64
HW&SW Configuration
Spark 2.0.2, Standalone mode
Hadoop 2.7.3
Platform Ubuntu 16.04.	LTS
HW 2	nodes, each	node	consist of	56	Xeon	
E5	processors,	500G	RAM,	7TB	SATA	
hard	drive
The dataset is normalized
Each point is saved as float format
Source:
1. http://corpus-texmex.irisa.fr/
2. https://genmone.ucsc.edu
3. https://www.ncdc.gov/
1
2
3
State-of-the-Art: Yagoubi, Djamel-Edine, et al. "DPiSAX:
Massively Distributed Partitioned iSAX." ICDM 2017
The initial cardinality of the baseline system is the default value
and it needs a large initial value to guarantee enough bit level for
binary split.
Baseline Spark-ITS
Initial	cardinality 512 64
Word	length 8 8
Sampling	percent 10% 10%
Leaf node	split	threshold	
of	Local index 1000 1000
Index Construction Time
25
0
500
1,000
1,500
2,000
200m 400m 600m 800m 1b 200m 400m 600m 800m 1b
Spark-ITS Baseline
(Minuts)
#Time Series
Global Index
Local Index
2,323
334
Dataset: Random Walk Benchmark
80+%
State-of-the-Art
0
10
20
30
40
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
200m 400m 600m 800m 1b
(Minutes)
sampling
statistic
build index
assign Pid
State-of-Art
State-of-Art
State-of-Art
State-of-Art
State-of-Art
Sampling
Statistic
Build Index
Assign Pid
Index Construction Time: Breakdown
26
Global Index Time	Breakdown Repartition	and	Local	Index	Time	Breakdown
Dataset: Random Walk Benchmark
0
500
1,000
1,500
2,000
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
Spark-ITS
Baseline
200m 400m 600m 800m 1b
(Minutes)
iSAX read and convert
Shuffle and build Local index
Read and conversion
State-of-Art
State-of-Art
State-of-Art
State-of-Art
State-of-Art
Shuffle and Build Local Index
Exact Matching Query
27
State-of-
the-Art
State-of-the-Art
kNN-Approximate Query Performance
28
0%
15%
30%
45%
60%
RandomWalk Texmex DNA Noaa
✕30
✕ 35 ✕ 28
✕ 72
1.0
1.3
1.6
1.9
2.2
RandomWalk Texmex DNA Noaa
27%
26%
48%
34%
Recall Error	Ratio
Dataset (#Time series) Dataset (#Time series)
(400m) (400m) (200m) (200m) (400m) (400m) (200m) (200m)
State-of-the-Art
Conclusion
• Index Tree
– Large fan-out decreases the depth of leaf nodes
– Keeps better similarity at Word-level
– The signature simplifies the conversion of cardinality
• Spark-ITS: Index Construction
– Block-sampling and node statistic collection to fast build global index
– Synchronously build local indices within a partition
– Constructs Index faster 80+%.
• Spark-ITS: Query
– Exact Matching: the time decreases by 50%.
– kNN approximate: the accuracy increases more than 10 fold.
29
Acknowledge Funding from...
Xianjin Tech Co., Ltd.
Saudi Arabian Cultural Mission
WPI Computer Science Dept.,
NSF CNS: 305258 II-EN
NSF CRI: 0551584

More Related Content

PDF
Como se utiliza la tabla t de student (formulas)
PDF
The Future of Data Visualization on the Web. FrontEnd Con 2019.
PDF
The Future of Data Visualization on the Web (YGLF)
DOCX
Z distribution
DOC
Bouncing ball lab
PDF
Tabel x2
PDF
Tabla t student
PDF
Pattern Mining in large time series databases
Como se utiliza la tabla t de student (formulas)
The Future of Data Visualization on the Web. FrontEnd Con 2019.
The Future of Data Visualization on the Web (YGLF)
Z distribution
Bouncing ball lab
Tabel x2
Tabla t student
Pattern Mining in large time series databases

Similar to Spark-ITS: Indexing for Large-Scale Time Series Data on Spark with Liang Zhang (20)

PDF
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
PDF
Image similarity using symbolic representation and its variations
PPTX
Estado del Arte de la IA
KEY
Numpy Talk at SIAM
PPTX
TSIndexingIndexacao De Série ttemporal.pptx
PDF
Orthogonal Range Searching
PPTX
Time series data mining techniques
PDF
International Journal of Soft Computing, Mathematics and Control (IJSCMC)
PDF
Indexing and Mining a Billion Time series using iSAX 2.0
PDF
PDF
Titan X Research Paper
DOCX
Digital Signal Processing and Control System under MATLAB Environment
PPTX
Searching.pptx
PDF
Pandas
PPT
Secure information aggregation in sensor networks
PDF
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
PDF
Gwt sdm public
PDF
Instance-based learning (aka Case-based or Memory-based or non-parametric)
PPTX
Set Similarity Search using a Distributed Prefix Tree Index
PDF
MATLAB-Cheat-Sheet-for-Data-Science_LondonSchoolofEconomics (1).pdf
SLIDING WINDOW SUM ALGORITHMS FOR DEEP NEURAL NETWORKS
Image similarity using symbolic representation and its variations
Estado del Arte de la IA
Numpy Talk at SIAM
TSIndexingIndexacao De Série ttemporal.pptx
Orthogonal Range Searching
Time series data mining techniques
International Journal of Soft Computing, Mathematics and Control (IJSCMC)
Indexing and Mining a Billion Time series using iSAX 2.0
Titan X Research Paper
Digital Signal Processing and Control System under MATLAB Environment
Searching.pptx
Pandas
Secure information aggregation in sensor networks
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
Gwt sdm public
Instance-based learning (aka Case-based or Memory-based or non-parametric)
Set Similarity Search using a Distributed Prefix Tree Index
MATLAB-Cheat-Sheet-for-Data-Science_LondonSchoolofEconomics (1).pdf
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Ad

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Quality review (1)_presentation of this 21
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to machine learning and Linear Models
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
Supervised vs unsupervised machine learning algorithms
Data_Analytics_and_PowerBI_Presentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Reliability_Chapter_ presentation 1221.5784
Fluorescence-microscope_Botany_detailed content
Galatica Smart Energy Infrastructure Startup Pitch Deck
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction-to-Cloud-ComputingFinal.pptx
Quality review (1)_presentation of this 21
ISS -ESG Data flows What is ESG and HowHow
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to machine learning and Linear Models

Spark-ITS: Indexing for Large-Scale Time Series Data on Spark with Liang Zhang

  • 1. Liang Zhang (lzhang6@wpi.edu) Data Science Dept., Worcester Polytechnic Institute Spark-ITS: Indexing for Large-Scale Time Series Data on Spark #SAISEco5
  • 2. Prof. Elke A. Rundensteiner Prof. Mohamed Y. Eltabakh Liang Zhang Noura Alghamdi Data Science Research Group @ Worcester Polytechnic Institute Liang Zhang, Noura Alghamdi, Mohamed Y. Eltabakh, Elke A. Rundensteiner. TARDIS: Distributed Indexing Framework for Big Time Series Data. Proceedings of 35th IEEE International Conference on Data Engineering ICDE, 2019
  • 3. Outline • Motivation • Background • Spark-ITS Framework – Overview – Index Construction – Query Processing • Performance Evaluation 3
  • 4. Time Series are Continuously Produced Everywhere • How to deal with billions of time series? 4 Climate data Web log Stock priceEEG
  • 5. Almost all Time Series Data Mining Tasks rely on Similarity Query 5 Esling, Philippe, and Carlos Agon. "Time-series data mining." ACM (CSUR) 45.1 (2012): 12. Clustering Motif Discovery Classification çΩ Outlier Detection Whole Matching Subsequence Matching
  • 6. Spark-ITS • A new Index Tree and an effective Signature to simplify the cardinality conversion and keep better similarity • A Distributed Index Framework to support large-scale time series dataset • Efficient algorithms for Exact Match and kNN Approximate queries process 6
  • 7. Spark-ITS Overview 7 Global Index Indexed Data Local Index Query Partition 1. Sampling 2. Node Statistic 3. Build Index Tree 4. Assign Partition ID 1. Construct Local Structure 2. Construct Bloom Filter 1. Read and convert data 2. Shuffle data
  • 8. Background: iSAX Representation 8 Shieh, Jin, and Eamonn Keogh. "iSAX: indexing and mining terabyte sized time series." SIGKDD ACM, 2008. Camerra, A., Palpanas, T., Shieh, J., & Keogh, E. "iSAX 2.0: Indexing and mining one billion time series." ICDM, 2010 A time series of length 16 PAA representation with 4 segments PAA: Piecewise Aggregate Approximation iSAX: indexable Symbolic Aggregate approXimation SAX representation with 4 segments and cardinality 4 [11,10,01,00] iSAX representation with 4 segments and variable cardinality [1", 1", 𝟎𝟏 𝟒, 0"]
  • 9. Word-level Similarity 9 -2 -1 0 1 2 1 3 5 7 9 11 A B C 111 110 101 100 011 010 001 000 -2 -1 0 1 2 1 3 5 7 9 11 A B C 111 110 101 100 011 010 001 000 State-of-the-art: Character-level Similarity Proposed: Word-level Similarity B and C are similar A and C are similar A: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟏 𝟑, 𝟏 𝟏] B: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟎 𝟑, 𝟏 𝟏] C: [𝟎 𝟏, 𝟎 𝟏, 𝟎𝟏𝟎 𝟑, 𝟏 𝟏] A: [𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟏𝟎 𝟐] B: [𝟎𝟎 𝟐, 𝟎𝟎 𝟐, 𝟎𝟏 𝟐, 𝟏𝟏 𝟐] C: [𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟎𝟏 𝟐, 𝟏𝟎 𝟐]
  • 10. New Index Tree Supports Word-level Similarity 10 Proposed: iSAX-T K-ary TreeState-of-the-art: iSAX Binary Tree Root 0", 1",0" 1", 1", 1"0", 0", 0" 0", 11%,0" 0", 10%, 0" 0",11%, 01% 0", 11%,00% . . .. . . Leaf nodeInternal node 1st bit 2nd bit 3rd bit Root 1",1",0" 1",1",1"0",0",0" 01$, 00$, 10$ 01$, 01$, 11$ 010&,000&,100& . . . 0",0",1" . . . 00$, 00$, 10$ 011&, 001&, 101& . . . Leaf nodeInternal node
  • 11. iSAX-T(Transpose) Signature 11 iSAX-T SAX(T,4,16) = {1100, 1101, 0110, 0001} = CE25 SAX(T,4,8) = {110, 110, 011, 000 } = CE2 SAX(T,4,4) = {11, 11, 01, 00 } = CE SAX(T,4,2) = {1, 1, 0, 0 } = C HexTranspose 1 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 0 1 1 1 0 0 0 1 0 0 1 0 1 C E 2 5 Time series: [1100, 1101, 0110, 0001]
  • 12. Outline • Motivation • Background • Spark-ITS Framework – Overview – Index Construction – Query Processing • Performance Evaluation 12
  • 13. 1.368099 2.713573 -4.851872 -2.710113 -5.577432 0.797747 -0.998534 0.535733 -2.244053 -0.298195 -5.040225 -0.093288 2.683385 -4.839688 5.617443 -0.087439 -0.857566 2.537812 -3.809641 0.638194 0.706312 -3.016157 -3.094813 4.975719 -3.664357 1.402586 0.444090 1.969943 1.282233 1.912557 2.277926 1.511366 0.945206 5.769843 0.406734 -4.205288 0.850925 -2.994073 1.270280 1.286681 -5.681450 3.137617 -4.996282 3.160174 -8.749059 2.648822 6.117611 3.109095 3.159747 0.442472 -1.482878 3.432288 0.960204 0.380183 -3.925308 -2.112708 -2.991460 -3.692369 4.508871 6.430551 -3.929611 -4.271633 0.268938 -1.756457 0.978831 0.783966 -1.982449 1.100825 -6.741050 0.882729 -3.098735 -0.330746 -0.659716 1.345305 -1.537599 -0.639539 2.028107 3.638267 0.211225 5.067515 -0.479032 2.713979 -2.921332 1.231413 -1.559693 -1.057173 -0.335133 -3.601023 -4.891684 -1.832524 -2.828772 0.257098 2.288298 -4.795566 -0.054114 1.991941 Global Index[1/4]: Sampling 1.368099 2.713573 -4.851872 -2.710113 -5.577432 0.797747 -0.998534 0.535733 -2.244053 -0.298195 -5.040225 -0.093288 2.683385 -4.839688 5.617443 -0.087439 -0.857566 2.537812 -3.809641 0.638194 0.706312 -3.016157 -3.094813 4.975719 0.594820 1.136821 0.163368 5.379237 -5.453637 -0.282540 0.572556 6.158454 -1.632961 -1.560935 2.514265 2.987787 -3.006184 0.965107 4.543610 0.614290 1.851868 -2.935539 -0.716928 2.357205 -3.126861 1.620514 -0.490122 -3.380533 -2.301087 -4.727099 6.885664 -5.210190 1.707254 -7.965270 -0.914942 0.622116 1.620520 -0.994487 -0.021151 -1.749576 -3.664357 1.402586 0.444090 1.969943 1.282233 1.912557 2.277926 1.511366 0.945206 5.769843 0.406734 -4.205288 0.850925 -2.994073 1.270280 1.286681 -5.681450 3.137617 -4.996282 3.160174 -8.749059 2.648822 6.117611 3.109095 -3.164684 0.884269 2.925519 -1.051656 -0.371788 -1.661374 0.041967 0.126226 -5.662528 -1.026395 -1.317764 2.268905 -2.998881 2.628193 -4.195228 2.261641 -1.676540 -1.646810 -0.056534 -1.551837 -5.098098 2.857196 -2.981121 -0.559482 -0.573813 2.996416 -2.567590 3.113241 2.385687 -1.195035 -4.255606 2.898200 -2.443996 1.196084 -0.759899 -2.065437 -0.709810 -2.848963 4.183200 -0.901386 3.303330 -2.852400 0.022771 1.184107 1.556879 1.825932 -2.375073 1.430655 -1.844884 0.074289 3.309890 1.141529 2.022026 0.552751 3.941867 -4.384278 -0.374243 1.231169 3.094143 1.208599 -1.893548 4.995226 -4.282130 -1.408826 4.439037 1.134620 3.159747 0.442472 -1.482878 3.432288 0.960204 0.380183 -3.925308 -2.112708 -2.991460 -3.692369 4.508871 6.430551 -3.929611 -4.271633 0.268938 -1.756457 0.978831 0.783966 -1.982449 1.100825 -6.741050 0.882729 -3.098735 -0.330746 1.041127 -3.140574 -1.436662 2.035271 0.203884 -0.821091 -0.659716 1.345305 -1.537599 -0.639539 2.028107 3.638267 0.211225 5.067515 -0.479032 2.713979 -2.921332 1.231413 -1.559693 -1.057173 -0.335133 -3.601023 -4.891684 -1.832524 -2.828772 0.257098 2.288298 -4.795566 -0.054114 1.991941 -3.091072 4.949271 -0.935447 3.327516 -3.299987 4.897994 -2.998881 2.628193 -4.195228 2.261641 -1.676540 -1.646810 -0.056534 -1.551837 -5.098098 2.857196 -2.981121 -0.559482 -0.573813 2.996416 -2.567590 3.113241 2.385687 -1.195035 -4.255606 2.898200 -2.443996 1.196084 -0.759899 -2.065437 -0.709810 -2.848963 4.183200 -0.901386 3.303330 -2.852400 0.022771 1.184107 1.556879 1.825932 -2.375073 1.430655 -1.844884 0.074289 3.309890 1.141529 2.022026 0.552751 3.941867 -4.384278 -0.374243 1.231169 3.094143 1.208599 -1.893548 4.995226 -4.282130 -1.408826 4.439037 1.134620 3.159747 0.442472 -1.482878 3.432288 0.960204 0.380183 13 Sampling Map Reduce iSAX-T(b bits), Freq:1Time series iSAX-T(b bits), Freq(b bits) 1256ae3e , 1 0134ef45 , 1 234567ae , 1 1256ae3e , 1 1256ae3e , 1 234567ae , 1 237867ae , 1 024567ae , 1 6243371e , 1 …… 452167ef , 1 1256ae3e , 23 0134ef45 , 2 234567ae , 20 024567ae , 4 237867ae , 10 …… 452167ef , 10 Segment Number: 8, so use 2 letters to represent 1 bit Initial cardinality: b bit level The data size is based on 1 billion time series with 256 length Word counting MapReduce process 1 Terabyte 100 G 0.9 G HDFS 0.1 G
  • 14. Global Index[2/4]: Node Statistic 14 (iSAX-T(b),Freq(b)) (iSAX-T(1),Freq(b)) [(iSAX-T(1),Freq(1))] max(Freq(1)) Map Reduce Judge 1st layer: 2nd layer: Filter (iSAX-T(b),Freq(b)) (iSAX-T(2),Freq(b)) [(iSAX-T(2),Freq(2))] max(Freq(2)) Map Reduce Judge Filter 3rd layer: (iSAX-T(b),Freq(b)) (iSAX-T(3),Freq(b)) [(iSAX-T(3),Freq(3))] max(Freq(3)) Map Reduce Judge ……
  • 15. Global Index[3/4]: Build Tree 15 Root iSAX-T: 01 Freq: 512 . . . Segment number: 8 Partition Capacity: 100,000 iSAX-T: 02 Freq: 350,000 iSAX-T: 03 Freq: 4,352 iSAX-T: ff Freq: 270,520 iSAX-T: 0201 Freq: 5,012 iSAX-T: 0202 Freq: 100,550 iSAX-T: 02ff Freq: 620. . . iSAX-T: 020201 Freq: 12 iSAX-T: 020202 Freq: 550 iSAX-T: 0202ff Freq: 620 . . . • (“01”, 512) • (“02”, 355,000) • …. • (“ff”, 270,520) • (“0201”, 5,012) • (“0202”, 100,550) • …. • (“ffff”, 10,520) • (“020201”, 12) • (“020202”, 550) • …. • (“0202ff”, 620) 1st layer (iSAX-T, Freq) 2nd layer (iSAX-T, Freq) 3rd layer (iSAX-T, Freq)
  • 16. Global Index[4/4]: Assign Partition Id to Leaf Nodes 16 Bin Packing Problem: How to fit a set of nodes in the smallest numbers of partitions? Partition capacity: 100,000 Partition ID: 1 Partition ID: 2 Partition ID: 3 iSAX-T: 02 Freq: 390,500 iSAX-T: 0201 Freq: 70,000 iSAX-T: 0202 Freq: 50,000 iSAX-T: 0203 Freq: 20,000 iSAX-T: 0204 Freq: 40,000 iSAX-T: 0205 Freq: 80,000 iSAX-T: 0206 Freq: 130,500 iSAX-T: 0202 Freq: 50,000 iSAX-T: 0204 Freq: 40,000 iSAX-T: 0201 Freq: 70,000 iSAX-T: 0203 Freq: 20,000 iSAX-T: 0205 Freq: 80,000
  • 17. Repartition: Wrap Global Index as the Partitioner 17 Root iSAX-T: 01 Freq: 512 pid: 1 . . . iSAX-T: 02 Freq: 350,000 pid: 5,6,7 iSAX-T: 03 Freq: 4,352 pid:1 iSAX-T: ff Freq: 360,520 pid: 10,11,12 iSAX-T: 0201 Freq: 5,012 pid: 5 iSAX-T: 0202 Freq: 100,550 pid: 6,7 iSAX-T: 02ff Freq: 620 pid: 5 . . . iSAX-T: 020201 Freq: 12 pid: 6 iSAX-T: 020202 Freq: 550 pid:6 iSAX-T: 0202ff Freq: 620 . . . iSAX-T: 0202ff45 TS: [0.34, 0.31, 1.14…] iSAX-T: 0202ff45 A Time Series iSAX-T: 0202ff45 iSAX-T: 0202ff45
  • 18. Local Index: Construction Within Each Partition 18 Partition capacity: 100,000 Node split threshold: 1000 Segment Number: 8 abcd Freq:5000 ab3c Freq: 450 Root Freq: 90,990 abcd12 Freq: 1010 abcd45 Freq: 96 … iSAX-T, ts, rid iSAX-T, ts, rid iSAX-T, ts, rid iSAX-T, ts, rid iSAX-T, ts, rid … iSAX-T, ts, rid iSAX-T, ts, rid iSAX-T, ts, rid …… iSAX-T, ts, rid iSAX-T, ts, rid abcd12ff Freq: 30 abcd1201 Freq: 42 …… ab45 Freq:4000 …… … iSAX-T, ts, rid iSAX-T, ts, rid iSAX-T, ts, rid … iSAX-T, ts, rid iSAX-T, ts, rid iSAX-T, ts, rid iSAX-T: abcd12ef34, …. +1 +1 +1 1.368099 2.713573 -4.851872 -2.710113 - 5.577432 0.797747 - 0.998534 0.535733 -2.244053 - 0.298195 -5.040225 -0.093288 2.683385 - 4.839688 5.617443 - 0.087439 - 0.857566 2.537812 - 3.809641 0.638194 0.706312 - 3.016157 - 3.094813 4.975719 0.594820 1.136821 0.163368 5.379237 - 5.453637 - 0.282540 0.572556 6.158454 - 1.632961 - 1.560935 2.514265 2.987787 -3.006184 0.965107 4.543610 0.614290 1.851868 - 2.935539 - 0.716928 2.357205 -3.126861 1.620514 -0.490122 - 3.380533 -2.301087 - 4.727099 6.885664 -5.210190 1.707254 - 7.965270 -0.914942 0.622116 1.620520 - 0.994487 - 0.021151 - 1.749576 - 3.664357 1.402586 0.444090 1.969943 1.282233 1.912557 2.277926 1.511366 0.945206 5.769843 0.406734 -4.205288 0.850925 - 2.994073 1.270280 1.286681 - 5.681450 3.137617 - 4.996282 3.160174 - 8.749059 2.648822 6.117611 3.109095 - 3.164684 0.884269 2.925519 - 1.051656 - 0.371788 - 1.661374 0.041967 0.126226 - 5.662528 -1.026395 - 1.317764 2.268905 - 2.998881 2.628193 - 4.195228 2.261641 - 1.676540 - 1.646810 - 0.056534 - 1.551837 - 5.098098 2.857196 -2.981121 - 0.559482 - 0.573813 2.996416 - 2.567590 3.113241 2.385687 - 1.195035 - 4.255606 2.898200 - 2.443996 1.196084 - 0.759899 - 2.065437 - 0.709810 - 2.848963 4.183200 - 0.901386 3.303330 - 2.852400 0.022771 1.184107 1.556879 1.825932 - 2.375073 1.430655 - 1.844884 0.074289 3.309890 1.141529 2.022026 0.552751 3.941867 -4.384278 - 0.374243 1.231169 3.094143 1.208599 - 1.893548 4.995226 -4.282130 - 1.408826 4.439037 1.134620 3.159747 0.442472 - 1.482878 3.432288 0.960204 0.380183 - 3.925308 - 2.112708 - 2.991460 - 3.692369 4.508871 6.430551 - 3.929611 - 4.271633 0.268938 - 1.756457 0.978831 0.783966 - 1.982449 1.100825 - 6.741050 0.882729 - 3.098735 - 0.330746 1.041127 - 3.140574 - 1.436662 2.035271 0.203884 - 0.821091 - 0.659716 1.345305 - 1.537599 - 0.639539 2.028107 3.638267 0.211225 5.067515 - 0.479032 2.713979 -2.921332 1.231413 - 1.559693 - 1.057173 - 0.335133 -3.601023 - 4.891684 - 1.832524 - 2.828772 0.257098 2.288298 - 4.795566 - 0.054114 1.991941 - 3.091072 4.949271 - 0.935447 3.327516 -3.299987 4.897994 1.306157 1.228019 -2.920305 0.710852 - 2.590932 - 3.644530 Time series in one partition Local Index Bloom Filter iSAX-T: abcd12ef34, ….
  • 19. Outline • Motivation • Background • Spark-ITS Framework – Overview – Index Construction – Query Processing • Performance Evaluation 19
  • 20. 000*,001*,001* Isax-t: 002 Exact Matching Query 20 Local Index Master Global Index Query Records in leaf node euDist = 0 Pid 6 Bloom Filter No Yes Exist? Pid:6 Worker Partition 6
  • 21. KNN Approximate Query: One Partition Access 21 Worker 000*,001*,001* Isax-t: 002 Local Index Master iSAX-T Skeleton Query Pid Pid:6 (euDist, rid)1. euDist 2. sort 3. take Top(K)Records in leaf / internal node Partition 6 Records in leaf/internal node 1. euDist 2. sort 3. Top(K) dist as threshold
  • 22. KNN Approximate Query: Multi-Partitions Access 22 000*,001*,001* iSAX-T: 002 Partition 6 Local Index Master Worker (euDist,rid) Local Index 1. euDist 2. sort 3. take Top(K) Sibling Pid List iSAX-T Skeleton Records in leaf/internal node 1. euDist 2. sort 3. Top(K) dist as threshold Records in leaf / internal node Query
  • 23. Outline • Motivation • Background • Spark-ITS Framework – Overview – Index Construction – Query Processing • Performance Evaluation 23
  • 24. Experimental Setup 24 Dataset Size Length Random Walk 1 billion 256 Texmex 1 billion 128 DNA 200 million 192 Noaa Climate 200 million 64 HW&SW Configuration Spark 2.0.2, Standalone mode Hadoop 2.7.3 Platform Ubuntu 16.04. LTS HW 2 nodes, each node consist of 56 Xeon E5 processors, 500G RAM, 7TB SATA hard drive The dataset is normalized Each point is saved as float format Source: 1. http://corpus-texmex.irisa.fr/ 2. https://genmone.ucsc.edu 3. https://www.ncdc.gov/ 1 2 3 State-of-the-Art: Yagoubi, Djamel-Edine, et al. "DPiSAX: Massively Distributed Partitioned iSAX." ICDM 2017 The initial cardinality of the baseline system is the default value and it needs a large initial value to guarantee enough bit level for binary split. Baseline Spark-ITS Initial cardinality 512 64 Word length 8 8 Sampling percent 10% 10% Leaf node split threshold of Local index 1000 1000
  • 25. Index Construction Time 25 0 500 1,000 1,500 2,000 200m 400m 600m 800m 1b 200m 400m 600m 800m 1b Spark-ITS Baseline (Minuts) #Time Series Global Index Local Index 2,323 334 Dataset: Random Walk Benchmark 80+% State-of-the-Art
  • 26. 0 10 20 30 40 Spark-ITS Baseline Spark-ITS Baseline Spark-ITS Baseline Spark-ITS Baseline Spark-ITS Baseline 200m 400m 600m 800m 1b (Minutes) sampling statistic build index assign Pid State-of-Art State-of-Art State-of-Art State-of-Art State-of-Art Sampling Statistic Build Index Assign Pid Index Construction Time: Breakdown 26 Global Index Time Breakdown Repartition and Local Index Time Breakdown Dataset: Random Walk Benchmark 0 500 1,000 1,500 2,000 Spark-ITS Baseline Spark-ITS Baseline Spark-ITS Baseline Spark-ITS Baseline Spark-ITS Baseline 200m 400m 600m 800m 1b (Minutes) iSAX read and convert Shuffle and build Local index Read and conversion State-of-Art State-of-Art State-of-Art State-of-Art State-of-Art Shuffle and Build Local Index
  • 28. kNN-Approximate Query Performance 28 0% 15% 30% 45% 60% RandomWalk Texmex DNA Noaa ✕30 ✕ 35 ✕ 28 ✕ 72 1.0 1.3 1.6 1.9 2.2 RandomWalk Texmex DNA Noaa 27% 26% 48% 34% Recall Error Ratio Dataset (#Time series) Dataset (#Time series) (400m) (400m) (200m) (200m) (400m) (400m) (200m) (200m) State-of-the-Art
  • 29. Conclusion • Index Tree – Large fan-out decreases the depth of leaf nodes – Keeps better similarity at Word-level – The signature simplifies the conversion of cardinality • Spark-ITS: Index Construction – Block-sampling and node statistic collection to fast build global index – Synchronously build local indices within a partition – Constructs Index faster 80+%. • Spark-ITS: Query – Exact Matching: the time decreases by 50%. – kNN approximate: the accuracy increases more than 10 fold. 29
  • 30. Acknowledge Funding from... Xianjin Tech Co., Ltd. Saudi Arabian Cultural Mission WPI Computer Science Dept., NSF CNS: 305258 II-EN NSF CRI: 0551584