SlideShare a Scribd company logo
CTF: Anomaly Detection in High-Dimensional
Time Series with Coarse-to-Fine Model Transfer
Ming Sun, Ya Su, Shenglin Zhang, Yuanpu Cao, Yuqing Liu, Dan Pei,
Wenfei Wu, Yongsu Zhang, Xiaozhou Liu, Junliang Tang
INFOCOM 2021
Outline
Background Design Evaluation Conclusion
2
Outline
Background Design Evaluation Conclusion
3
DL Algorithms in the Infra Operation
4
• Advantages
– automation
– robustness
– Saving operator’s labor
• Example:
– RNN-VAE for anomaly detection
RNN-VAE Based Algorithms
5
Network architecture of RNN-VAE
models at time t
𝒙𝒕 (49) -> 𝒛𝒕 (3) -> 𝒙𝒕
%
(49)
RNN
Dense
layers
"# $# "#
%
RNN
Dense
layers
Variational Auto-Encoder (VAE)
KPI dimension reduced
Network Layers
• RNN: Shallow & general
• Dense layers: Deep & specific
Scalability is the problem for large scale
6
• High-Dimensional Data
– Machines: in millions
– KPI: in tens
– Time: Frequent data query (2880 samples/day)
Ø One model per machine: time
10X minutes * 1X million machines
Ø One model for all: accuracy
Scalability is the problem for large scale
7
• High-Dimensional Data
– Machines: in millions
– KPI: in tens
– Time: Frequent data query (2880 samples/day)
Goal: devise scalable deep learning (DL) algorithms for
large-scale anomaly detection
8
• Intuition: Cluster Machines first, then run DL for each cluster
• Challenge 1: clustering model training
• Clustering cannot run on high-dimensional data
• DL cannot run on whole dataset without clustering
• Solution: Synthetic framework
Intuition and Challenges
dependency
Coarse-grained model -> clustering -> fine-grained models
9
• Intuition: Cluster Machines first, then run DL for each cluster
• Challenge 1: clustering model training
• Clustering cannot run on high-dimensional data
• DL cannot run on whole dataset without clustering
• Solution: Synthetic framework
• Challenge 2: High dimension of time domain
• Hard to cluster even KPI is compressed
• Solution: compress sequence to z-distribution
Intuition and Challenges
dependency
10
• Intuition: Cluster Machines first, then run DL for each cluster
• Challenge 1: clustering model training
• Clustering cannot run on high-dimensional data
• DL cannot run on whole dataset without clustering
• Solution: Synthetic framework
• Challenge 2: High dimension of time domain
• Hard to cluster even KPI is compressed
• Solution: compress sequence to z-distribution
• Challenge 3: Neural network training method
• Solution: fine-tuning strategy
• Freeze RNN and tune dense layers
Intuition and Challenges
dependency
RNN
Dense
layers
"# $# "#
%
RNN
Dense
layers
Outline
Background Design Evaluation Conclusion
11
Framework of model training
12
Framework of model training
:0 : 4 4 2 30 . : 0
2: 4 0 0
0 :0 0 : . 4
.34 0 . 0:4 2
4 :4- 4 1 .34 0
0
0 : 10: 14 0 4 2
14 0 2: 4 0 0 0: . 0:
. 0:
(M)
0
0 0 (K<<M)
0 1: : 4 .34 0
Framework of model training
13
Framework of model training
:0 : 4 4 2 30 . : 0
2: 4 0 0
0 :0 0 : . 4
.34 0 . 0:4 2
4 :4- 4 1 .34 0
0
0 : 10: 14 0 4 2
14 0 2: 4 0 0 0: . 0:
. 0:
(M)
0
0 0 (K<<M)
0 1: : 4 .34 0 • Sampling strategy:
• Machine sampling
• Time sampling
Framework of model training
14
Framework of model training
:0 : 4 4 2 30 . : 0
2: 4 0 0
0 :0 0 : . 4
.34 0 . 0:4 2
4 :4- 4 1 .34 0
0
0 : 10: 14 0 4 2
14 0 2: 4 0 0 0: . 0:
. 0:
(M)
0
0 0 (K<<M)
0 1: : 4 .34 0
𝒙𝒕 sequence
𝒛𝒕 sequence
𝒛𝒕 distribution
Framework of model training
15
Framework of model training
:0 : 4 4 2 30 . : 0
2: 4 0 0
0 :0 0 : . 4
.34 0 . 0:4 2
4 :4- 4 1 .34 0
0
0 : 10: 14 0 4 2
14 0 2: 4 0 0 0: . 0:
. 0:
(M)
0
0 0 (K<<M)
0 1: : 4 .34 0 𝒛𝒕 distribution
distance matrix
clustering results
Wasserstein distance
HAC algorithm
Framework of model training
16
Framework of model training
:0 : 4 4 2 30 . : 0
2: 4 0 0
0 :0 0 : . 4
.34 0 . 0:4 2
4 :4- 4 1 .34 0
0
0 : 10: 14 0 4 2
14 0 2: 4 0 0 0: . 0:
. 0:
(M)
0
0 0 (K<<M)
0 1: : 4 .34 0
RNN
Dense
layers
"# $# "#
%
RNN
Dense
layers
• Fine-tuning strategy:
• RNN: fixed
• Dense layers: tuned
System architecture
17
System architecture
Data API
Online Anomaly
Detection ( IV-C)
Offline Data
Online Data
Offline Model
Training ( IV-B)
Model Score
Outlier Alerting
( V-D)
Results &
Visualization
Data Preprocessing
( IV-A)
Monitored
machine entities
1. Data preprocessing
2. Offline model training
3. Online anomaly detection
Labeling tools
18
The interface of the labeling tool
Outline
Background Design Evaluation Conclusion
19
Dataset & performance metrics
20
• Dataset:
– # Machine entities: 533
– Dimension of each machine entity: 49 KPIs x 37440 time
points (frequency: 30s, 13 days)
– Training = first 5 days, Testing = last 8 days
• Metrics:
– F1, Precision, Recall: average of all machine entities.
– Model training time
Overall performance
• Scalability
– Pre-training: fixed (5493s)
21
The execution time of each step under different
numbers of machine entities
F1, Precision, and Recall scores of CTF without
and with alerting
Overall performance
• Scalability
– Pre-training: fixed (5493s)
– feature extraction: 0.3s /
machine
22
The execution time of each step under different
numbers of machine entities
F1, Precision, and Recall scores of CTF without
and with alerting
Overall performance
• Scalability
– Pre-training: fixed (5493s)
– feature extraction: 0.3s /
machine
– Clustering: much smaller
– Fine-tuning: 448s / model
23
The execution time of each step under different
numbers of machine entities
F1, Precision, and Recall scores of CTF without
and with alerting
Overall performance
• Scalability
– Pre-training: fixed (5493s)
– feature extraction: 0.3s /
machine
– Clustering: much smaller
– Fine-tuning: 448s / model
• Effectiveness
– F1: 0.830->0.892
24
The execution time of each step under different
numbers of machine entities
F1, Precision, and Recall scores of CTF without
and with alerting
Overall performance
• Validating the Synthetic
Framework
– One model/machine
– One model for all
– CTF w/o transfer
25
Comparison with model variations
F1 and training time under different numbers of
epochs for CTF w/o transfer
Overall performance
• Validating the Synthetic
Framework
– One model/machine
– One model for all
– CTF w/o transfer
26
Comparison with model variations
F1 and training time under different numbers of
epochs for CTF w/o transfer
Overall performance
• Validating the Synthetic
Framework
– One model/machine
– One model for all
– CTF w/o transfer
27
Comparison with model variations
F1 and training time under different numbers of
epochs for CTF w/o transfer
Overall performance
• Validating the Synthetic
Framework
– One model/machine
– One model for all
– CTF w/o transfer
28
Comparison with model variations
F1 and training time under different numbers of
epochs for CTF w/o transfer
Validating Design Choices
• Choice of Clustering Objects
– SPF, ROCKA, DCN
• Choice of Distance Measures
– KL divergence, JS divergence,
mean squared error
• Choice of Clustering Algorithms
– DBSCAN, K-medoids
29
Outline
Background Design Evaluation Conclusion
30
Conclusion
• CTF: synthetic framework, high-dimensional time series
(machine, KPI, time)
• Techniques: 𝒛𝒕 distribution clustering, model reuse, fine-tuning
• Evaluation: CTF scalability and effectiveness
• Labeling tool + labeled dataset
31
Thank you!
Q & A
sunm19@mails.tsinghua.edu.cn
INFOCOM 2021
32

More Related Content

PDF
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
PPTX
Trackster Pruning at the CMS High-Granularity Calorimeter
PDF
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
PDF
Deep Learning Initiative @ NECSTLab
PDF
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
PDF
深度學習在AOI的應用
PDF
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
PDF
A study of Machine Learning approach for Predictive Maintenance in Industry 4.0
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Trackster Pruning at the CMS High-Granularity Calorimeter
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
Deep Learning Initiative @ NECSTLab
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
深度學習在AOI的應用
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
A study of Machine Learning approach for Predictive Maintenance in Industry 4.0

Similar to CTF: Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer (20)

PPTX
ExplainableAI.pptx
PDF
Two strategies for large-scale multi-label classification on the YouTube-8M d...
PDF
Deep learning fundamental and Research project on IBM POWER9 system from NUS
PDF
Improving Hardware Efficiency for DNN Applications
PDF
Machine Learning @NECST
PPTX
Handwritten Digit Recognition and performance of various modelsation[autosaved]
PDF
3_Transfer_Learning.pdf
 
PPTX
Application of machine learning and cognitive computing in intrusion detectio...
PDF
Tsinghua University: Two Exemplary Applications in China
PPTX
NLP Classifier Models & Metrics
PPTX
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
PDF
Tailoring Small Language Models for Enterprise Use Cases
PPTX
Surveillance scene classification using machine learning
PPTX
Master defence 2020 -Volodymyr Lut-Neural Architecture Search: a Probabilisti...
PPTX
Beyond data and model parallelism for deep neural networks
PPT
convolutional_rbm.ppt
PPTX
Cvpr 2018 papers review (efficient computing)
PDF
Intelligence at scale through AI model efficiency
PDF
Thesis Report - Gaurav Raina MSc ES - v2
PDF
Review of the paper: Traffic-aware Frequency Scaling for Balanced On-Chip Net...
ExplainableAI.pptx
Two strategies for large-scale multi-label classification on the YouTube-8M d...
Deep learning fundamental and Research project on IBM POWER9 system from NUS
Improving Hardware Efficiency for DNN Applications
Machine Learning @NECST
Handwritten Digit Recognition and performance of various modelsation[autosaved]
3_Transfer_Learning.pdf
 
Application of machine learning and cognitive computing in intrusion detectio...
Tsinghua University: Two Exemplary Applications in China
NLP Classifier Models & Metrics
RT15 Berkeley | ARTEMiS-SSN Features for Micro-grid / Renewable Energy Sourc...
Tailoring Small Language Models for Enterprise Use Cases
Surveillance scene classification using machine learning
Master defence 2020 -Volodymyr Lut-Neural Architecture Search: a Probabilisti...
Beyond data and model parallelism for deep neural networks
convolutional_rbm.ppt
Cvpr 2018 papers review (efficient computing)
Intelligence at scale through AI model efficiency
Thesis Report - Gaurav Raina MSc ES - v2
Review of the paper: Traffic-aware Frequency Scaling for Balanced On-Chip Net...
Ad

Recently uploaded (20)

PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Understanding Forklifts - TECH EHS Solution
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
System and Network Administration Chapter 2
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
System and Network Administraation Chapter 3
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
history of c programming in notes for students .pptx
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
ai tools demonstartion for schools and inter college
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Understanding Forklifts - TECH EHS Solution
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
CHAPTER 2 - PM Management and IT Context
System and Network Administration Chapter 2
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Odoo POS Development Services by CandidRoot Solutions
Design an Analysis of Algorithms I-SECS-1021-03
System and Network Administraation Chapter 3
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
2025 Textile ERP Trends: SAP, Odoo & Oracle
history of c programming in notes for students .pptx
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
wealthsignaloriginal-com-DS-text-... (1).pdf
ai tools demonstartion for schools and inter college
Ad

CTF: Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer

  • 1. CTF: Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer Ming Sun, Ya Su, Shenglin Zhang, Yuanpu Cao, Yuqing Liu, Dan Pei, Wenfei Wu, Yongsu Zhang, Xiaozhou Liu, Junliang Tang INFOCOM 2021
  • 4. DL Algorithms in the Infra Operation 4 • Advantages – automation – robustness – Saving operator’s labor • Example: – RNN-VAE for anomaly detection
  • 5. RNN-VAE Based Algorithms 5 Network architecture of RNN-VAE models at time t 𝒙𝒕 (49) -> 𝒛𝒕 (3) -> 𝒙𝒕 % (49) RNN Dense layers "# $# "# % RNN Dense layers Variational Auto-Encoder (VAE) KPI dimension reduced Network Layers • RNN: Shallow & general • Dense layers: Deep & specific
  • 6. Scalability is the problem for large scale 6 • High-Dimensional Data – Machines: in millions – KPI: in tens – Time: Frequent data query (2880 samples/day) Ø One model per machine: time 10X minutes * 1X million machines Ø One model for all: accuracy
  • 7. Scalability is the problem for large scale 7 • High-Dimensional Data – Machines: in millions – KPI: in tens – Time: Frequent data query (2880 samples/day) Goal: devise scalable deep learning (DL) algorithms for large-scale anomaly detection
  • 8. 8 • Intuition: Cluster Machines first, then run DL for each cluster • Challenge 1: clustering model training • Clustering cannot run on high-dimensional data • DL cannot run on whole dataset without clustering • Solution: Synthetic framework Intuition and Challenges dependency Coarse-grained model -> clustering -> fine-grained models
  • 9. 9 • Intuition: Cluster Machines first, then run DL for each cluster • Challenge 1: clustering model training • Clustering cannot run on high-dimensional data • DL cannot run on whole dataset without clustering • Solution: Synthetic framework • Challenge 2: High dimension of time domain • Hard to cluster even KPI is compressed • Solution: compress sequence to z-distribution Intuition and Challenges dependency
  • 10. 10 • Intuition: Cluster Machines first, then run DL for each cluster • Challenge 1: clustering model training • Clustering cannot run on high-dimensional data • DL cannot run on whole dataset without clustering • Solution: Synthetic framework • Challenge 2: High dimension of time domain • Hard to cluster even KPI is compressed • Solution: compress sequence to z-distribution • Challenge 3: Neural network training method • Solution: fine-tuning strategy • Freeze RNN and tune dense layers Intuition and Challenges dependency RNN Dense layers "# $# "# % RNN Dense layers
  • 12. Framework of model training 12 Framework of model training :0 : 4 4 2 30 . : 0 2: 4 0 0 0 :0 0 : . 4 .34 0 . 0:4 2 4 :4- 4 1 .34 0 0 0 : 10: 14 0 4 2 14 0 2: 4 0 0 0: . 0: . 0: (M) 0 0 0 (K<<M) 0 1: : 4 .34 0
  • 13. Framework of model training 13 Framework of model training :0 : 4 4 2 30 . : 0 2: 4 0 0 0 :0 0 : . 4 .34 0 . 0:4 2 4 :4- 4 1 .34 0 0 0 : 10: 14 0 4 2 14 0 2: 4 0 0 0: . 0: . 0: (M) 0 0 0 (K<<M) 0 1: : 4 .34 0 • Sampling strategy: • Machine sampling • Time sampling
  • 14. Framework of model training 14 Framework of model training :0 : 4 4 2 30 . : 0 2: 4 0 0 0 :0 0 : . 4 .34 0 . 0:4 2 4 :4- 4 1 .34 0 0 0 : 10: 14 0 4 2 14 0 2: 4 0 0 0: . 0: . 0: (M) 0 0 0 (K<<M) 0 1: : 4 .34 0 𝒙𝒕 sequence 𝒛𝒕 sequence 𝒛𝒕 distribution
  • 15. Framework of model training 15 Framework of model training :0 : 4 4 2 30 . : 0 2: 4 0 0 0 :0 0 : . 4 .34 0 . 0:4 2 4 :4- 4 1 .34 0 0 0 : 10: 14 0 4 2 14 0 2: 4 0 0 0: . 0: . 0: (M) 0 0 0 (K<<M) 0 1: : 4 .34 0 𝒛𝒕 distribution distance matrix clustering results Wasserstein distance HAC algorithm
  • 16. Framework of model training 16 Framework of model training :0 : 4 4 2 30 . : 0 2: 4 0 0 0 :0 0 : . 4 .34 0 . 0:4 2 4 :4- 4 1 .34 0 0 0 : 10: 14 0 4 2 14 0 2: 4 0 0 0: . 0: . 0: (M) 0 0 0 (K<<M) 0 1: : 4 .34 0 RNN Dense layers "# $# "# % RNN Dense layers • Fine-tuning strategy: • RNN: fixed • Dense layers: tuned
  • 17. System architecture 17 System architecture Data API Online Anomaly Detection ( IV-C) Offline Data Online Data Offline Model Training ( IV-B) Model Score Outlier Alerting ( V-D) Results & Visualization Data Preprocessing ( IV-A) Monitored machine entities 1. Data preprocessing 2. Offline model training 3. Online anomaly detection
  • 18. Labeling tools 18 The interface of the labeling tool
  • 20. Dataset & performance metrics 20 • Dataset: – # Machine entities: 533 – Dimension of each machine entity: 49 KPIs x 37440 time points (frequency: 30s, 13 days) – Training = first 5 days, Testing = last 8 days • Metrics: – F1, Precision, Recall: average of all machine entities. – Model training time
  • 21. Overall performance • Scalability – Pre-training: fixed (5493s) 21 The execution time of each step under different numbers of machine entities F1, Precision, and Recall scores of CTF without and with alerting
  • 22. Overall performance • Scalability – Pre-training: fixed (5493s) – feature extraction: 0.3s / machine 22 The execution time of each step under different numbers of machine entities F1, Precision, and Recall scores of CTF without and with alerting
  • 23. Overall performance • Scalability – Pre-training: fixed (5493s) – feature extraction: 0.3s / machine – Clustering: much smaller – Fine-tuning: 448s / model 23 The execution time of each step under different numbers of machine entities F1, Precision, and Recall scores of CTF without and with alerting
  • 24. Overall performance • Scalability – Pre-training: fixed (5493s) – feature extraction: 0.3s / machine – Clustering: much smaller – Fine-tuning: 448s / model • Effectiveness – F1: 0.830->0.892 24 The execution time of each step under different numbers of machine entities F1, Precision, and Recall scores of CTF without and with alerting
  • 25. Overall performance • Validating the Synthetic Framework – One model/machine – One model for all – CTF w/o transfer 25 Comparison with model variations F1 and training time under different numbers of epochs for CTF w/o transfer
  • 26. Overall performance • Validating the Synthetic Framework – One model/machine – One model for all – CTF w/o transfer 26 Comparison with model variations F1 and training time under different numbers of epochs for CTF w/o transfer
  • 27. Overall performance • Validating the Synthetic Framework – One model/machine – One model for all – CTF w/o transfer 27 Comparison with model variations F1 and training time under different numbers of epochs for CTF w/o transfer
  • 28. Overall performance • Validating the Synthetic Framework – One model/machine – One model for all – CTF w/o transfer 28 Comparison with model variations F1 and training time under different numbers of epochs for CTF w/o transfer
  • 29. Validating Design Choices • Choice of Clustering Objects – SPF, ROCKA, DCN • Choice of Distance Measures – KL divergence, JS divergence, mean squared error • Choice of Clustering Algorithms – DBSCAN, K-medoids 29
  • 31. Conclusion • CTF: synthetic framework, high-dimensional time series (machine, KPI, time) • Techniques: 𝒛𝒕 distribution clustering, model reuse, fine-tuning • Evaluation: CTF scalability and effectiveness • Labeling tool + labeled dataset 31
  • 32. Thank you! Q & A sunm19@mails.tsinghua.edu.cn INFOCOM 2021 32