CTF: Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer

CTF: Anomaly Detection in High-Dimensional
Time Series with Coarse-to-Fine Model Transfer
Ming Sun, Ya Su, Shenglin Zhang, Yuanpu Cao, Yuqing Liu, Dan Pei,
Wenfei Wu, Yongsu Zhang, Xiaozhou Liu, Junliang Tang
INFOCOM 2021

Outline
Background Design Evaluation Conclusion
2

Outline
3

DL Algorithms in the Infra Operation
4
• Advantages
– automation
– robustness
– Saving operator’s labor
• Example:
– RNN-VAE for anomaly detection

RNN-VAE Based Algorithms
5
Network architecture of RNN-VAE
models at time t
𝒙𝒕 (49) -> 𝒛𝒕 (3) -> 𝒙𝒕
%
(49)
RNN
Dense
layers
"# $# "#
%
RNN
Dense
layers
Variational Auto-Encoder (VAE)
KPI dimension reduced
Network Layers
• RNN: Shallow & general
• Dense layers: Deep & specific

Scalability is the problem for large scale
6
• High-Dimensional Data
– Machines: in millions
– KPI: in tens
– Time: Frequent data query (2880 samples/day)
Ø One model per machine: time
10X minutes * 1X million machines
Ø One model for all: accuracy

Scalability is the problem for large scale
7
• High-Dimensional Data
– Machines: in millions
– KPI: in tens
– Time: Frequent data query (2880 samples/day)
Goal: devise scalable deep learning (DL) algorithms for
large-scale anomaly detection

8
• Intuition: Cluster Machines first, then run DL for each cluster
• Challenge 1: clustering model training
• Clustering cannot run on high-dimensional data
• DL cannot run on whole dataset without clustering
• Solution: Synthetic framework
Intuition and Challenges
dependency
Coarse-grained model -> clustering -> fine-grained models

9
• Challenge 2: High dimension of time domain
• Hard to cluster even KPI is compressed
• Solution: compress sequence to z-distribution
dependency

10
• Challenge 2: High dimension of time domain
• Hard to cluster even KPI is compressed
• Solution: compress sequence to z-distribution
• Challenge 3: Neural network training method
• Solution: fine-tuning strategy
• Freeze RNN and tune dense layers
dependency
RNN
Dense
layers
"# $# "#
%
RNN
Dense
layers

Outline
11

Framework of model training
12
:0 : 4 4 2 30 . : 0
2: 4 0 0
0 :0 0 : . 4
.34 0 . 0:4 2
4 :4- 4 1 .34 0
0
0 : 10: 14 0 4 2
14 0 2: 4 0 0 0: . 0:
. 0:
(M)
0
0 0 (K<<M)
0 1: : 4 .34 0

13
:0 : 4 4 2 30 . : 0
2: 4 0 0
0 :0 0 : . 4
.34 0 . 0:4 2
4 :4- 4 1 .34 0
0
0 : 10: 14 0 4 2
14 0 2: 4 0 0 0: . 0:
. 0:
(M)
0
0 0 (K<<M)
0 1: : 4 .34 0 • Sampling strategy:
• Machine sampling
• Time sampling

14
:0 : 4 4 2 30 . : 0
2: 4 0 0
0 :0 0 : . 4
.34 0 . 0:4 2
4 :4- 4 1 .34 0
0
0 : 10: 14 0 4 2
14 0 2: 4 0 0 0: . 0:
. 0:
(M)
0
0 0 (K<<M)
0 1: : 4 .34 0
𝒙𝒕 sequence
𝒛𝒕 sequence
𝒛𝒕 distribution

15
:0 : 4 4 2 30 . : 0
2: 4 0 0
0 :0 0 : . 4
.34 0 . 0:4 2
4 :4- 4 1 .34 0
0
0 : 10: 14 0 4 2
14 0 2: 4 0 0 0: . 0:
. 0:
(M)
0
0 0 (K<<M)
0 1: : 4 .34 0 𝒛𝒕 distribution
distance matrix
clustering results
Wasserstein distance
HAC algorithm

16
:0 : 4 4 2 30 . : 0
2: 4 0 0
0 :0 0 : . 4
.34 0 . 0:4 2
4 :4- 4 1 .34 0
0
0 : 10: 14 0 4 2
14 0 2: 4 0 0 0: . 0:
. 0:
(M)
0
0 0 (K<<M)
0 1: : 4 .34 0
RNN
Dense
layers
"# $# "#
%
RNN
Dense
layers
• Fine-tuning strategy:
• RNN: fixed
• Dense layers: tuned

System architecture
17
System architecture
Data API
Online Anomaly
Detection ( IV-C)
Offline Data
Online Data
Offline Model
Training ( IV-B)
Model Score
Outlier Alerting
( V-D)
Results &
Visualization
Data Preprocessing
( IV-A)
Monitored
machine entities
1. Data preprocessing
2. Offline model training
3. Online anomaly detection

Labeling tools
18
The interface of the labeling tool

Outline
19

Dataset & performance metrics
20
• Dataset:
– # Machine entities: 533
– Dimension of each machine entity: 49 KPIs x 37440 time
points (frequency: 30s, 13 days)
– Training = first 5 days, Testing = last 8 days
• Metrics:
– F1, Precision, Recall: average of all machine entities.
– Model training time

Overall performance
• Scalability
– Pre-training: fixed (5493s)
21
The execution time of each step under different
numbers of machine entities
F1, Precision, and Recall scores of CTF without
and with alerting

Overall performance
• Scalability
– feature extraction: 0.3s /
machine
22
and with alerting

Overall performance
• Scalability
machine
– Clustering: much smaller
– Fine-tuning: 448s / model
23
and with alerting

Overall performance
• Scalability
machine
– Clustering: much smaller
– Fine-tuning: 448s / model
• Effectiveness
– F1: 0.830->0.892
24
and with alerting

Overall performance
• Validating the Synthetic
Framework
– One model/machine
– One model for all
– CTF w/o transfer
25
Comparison with model variations
F1 and training time under different numbers of
epochs for CTF w/o transfer

Overall performance
Framework
26

Overall performance
Framework
27

Overall performance
Framework
28

Validating Design Choices
• Choice of Clustering Objects
– SPF, ROCKA, DCN
• Choice of Distance Measures
– KL divergence, JS divergence,
mean squared error
• Choice of Clustering Algorithms
– DBSCAN, K-medoids
29

Outline
30

Conclusion
• CTF: synthetic framework, high-dimensional time series
(machine, KPI, time)
• Techniques: 𝒛𝒕 distribution clustering, model reuse, fine-tuning
• Evaluation: CTF scalability and effectiveness
• Labeling tool + labeled dataset
31

Thank you!
Q & A
sunm19@mails.tsinghua.edu.cn
INFOCOM 2021
32

CTF: Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer

More Related Content

Similar to CTF: Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer (20)

Recently uploaded (20)

CTF: Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer