SlideShare a Scribd company logo
Topological Data Analysis (TDA)
and Use Cases
Kim Hee (kimheekimi@gmail.com)
Outline
1. Visualization by TDA
2. Insights Discovery & Feature Selection
3. Evaluate the Insights
22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 2
Visualization
22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 3
Raw Data
Filter/Filter&Metric
NodesDivision with
Redundancy
Point Cloud
e f g h
A 3 7 10 12
B 4 8 11 13
C 5 9 8 10
D 13 11 8 4
Network
Node A
Edge
Node B
Node A
Node B
 L2(A,B)= 4 𝒃𝒚 𝐴𝑏𝑠 3 − 4 2 + 𝐴𝑏𝑠 7 − 8 2 + 𝐴𝑏𝑠 10 − 11 2 + 𝐴𝑏𝑠 12 − 13 2
 L2(A,C)= 16 𝒃𝒚 𝐴𝑏𝑠 3 − 5 2 + 𝐴𝑏𝑠 7 − 9 2 + 𝐴𝑏𝑠 10 − 8 2 + 𝐴𝑏𝑠 12 − 10 2
 L2(A,D)= 180 𝒃𝒚 𝐴𝑏𝑠 4 − 13 2 + 𝐴𝑏𝑠 8 − 11 2 + 𝐴𝑏𝑠 11 − 8 2 + 𝐴𝑏𝑠 13 − 4 2
 cos(∠AOB)=0.999 𝒃𝒚
(𝟑×𝟒)+(𝟕×𝟖)+(𝟏𝟎×𝟏𝟏)+(𝟏𝟐×𝟏𝟑)
𝟑2+𝟕2+𝟏𝟎2+𝟏𝟐2 × 𝟒2+𝟖2+𝟏𝟏2+𝟏𝟑2
=
334
334.275
 cos (∠AOC)=0.974 𝒃𝒚
(𝟑×𝟓)+(𝟕×𝟗)+(𝟏𝟎×𝟖)+(𝟏𝟐×𝟏𝟎)
𝟑2+𝟕2+𝟏𝟎2+𝟏𝟐2 × 𝟓2+𝟗2+𝟖2+𝟏𝟎2
=
278
285.552
 cos(∠AOD)= 0.757 𝒃𝒚
(𝟒×𝟏𝟑)+(𝟖×𝟏𝟏)+(𝟏𝟏×𝟖)+(𝟏𝟑×𝟒)
𝟒2+𝟖2+𝟏𝟏2+𝟏𝟑2× 𝟏𝟑2+𝟏𝟏+𝟖2+𝟒2
=
280
370
Euclidean Distance, 𝐿2 𝑋, 𝑌
𝑖=1
𝑁
𝑋𝑖 − 𝑌𝑖
2
CosineSimilarity, cos θ
𝑖=1
𝑁
𝑋𝑖 × 𝑌𝑖
𝑖=1
𝑁
𝑋𝑖
2
× 𝑖=1
𝑁
𝑌𝑖
2
X, Y: data sample, Xi, Yi: each attribute, N: number of attributes
1. Visualization
2. Insights Discovery
3. Evaluation
Insights Discovery
Case 1 – Titanic
22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 4
1. Visualization
2. Insights Discovery
3. Evaluation
Insights Discovery
Case 2 – Energy Consumption
 Problem Domain
» Detect features that has correlation to the energy consumption
 Data Description
» Energy consumption history data in U.K. given by power plant
» 1,096 rows * 8 attributes
» Label attribute is volume, other are weather/calendar events
 Apply TDA →Discovered insights: Volume is correlated to day_type and school_holiday
1. Visualization
2. Insights Discovery
3. Evaluation
22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 6
1. Visualization
2. Insights Discovery
3. Evaluation
Insights Discovery
Case 3 – High Dimensional Data
 Problem Domain
» Detect features that can predict customers who may terminate service
 Data Description
» Customer data given by Orange telecom
» 50,000 rows * 233 attributes
» Label attribute is churn (binary)
» Other attributes are anonymous
 Apply TDA
Column Name Value Hypergeometric p-value
churn 1 1.00E-12
Var202 PXLV 3.78E-04
Var199 Gai9lEF2Fr 4.19E-04
Var198 Z4hPoJV 4.82E-04
Var222 xiJRusu 4.82E-04
⋮ ⋮ ⋮
Var220 Af96s0w 0.047965
Var220 rDm3DH0 0.047965
Var197 yMvB 0.049324
49 underlying features are captured
(p-value that smaller than 0.05)
The result of group comparison
Time to evaluate the insights…
22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 7
Evaluation Framework
label
sample 1 Y
sample 2 N
sample 3 Y
sample 4 Y
sample 5 N
sample 6 Y
sample 7 Y
sample 8 Y
sample 9 N
sample 10 Y
Method
Selected
Features
Reduction Accuracy
- all - 66%
PCA 7 features 22.22% 0%
RF 4 features 55.56% 33%
TDA 2 features 77.78% 100%
Sample Comparison Result
1. Visualization
2. Insights Discovery
3. Evaluation
prediction 1
label result
Y Y
N Y
Y Y
prediction 3prediction 2
ModelingEvaluation
Decision Tree
FeatureSelection
PCA TDAMRMR
Model 1 Model 2 Model 3 Model 4
label result
Y N
N Y
Y N
label result
Y N
N N
Y N
label result
Y Y
N N
Y Y
test data (30%)
Training data
(70%)
prediction 4
 Energy Consumption
22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 9
Evaluation
1. Visualization
2. Insights Discovery
3. Evaluation
Modeling
Dimensional reduction All PCA MRMR TDA
Reduction rate
(no. of selected features)
0 %
(0)
92.70 %
(17)
57.08 %
(100/default)
83.26 %
(39)
Evaluation
(F1 Score)
Model by
Naïve Bayes
0.147 0.005 0.146 0.147
Evaluation
(F1 Score)
Model by
Decision tree
0.016 0.002 0.023 0.036
Modeling
Dimensional
reduction
All PCA MRMR TDA
Reduction rate 0 % 66.67 % 88.89 % 77.78 %
Selected features all
winter,
solar_rad,
temp
day_type
day_type,
sch_holiday
Evaluation (MAPE)
Model by
Neural Network
3.0546 % 11.1026 % 5.7003 % 3.6406 %
Model by SVM 10.9843 % 11.0649 % 10.6166 % 10.7778 %
 High Dimensional Data
References
 Used tool: Ayasdi, http://www.ayasdi.com/
 Open source: Mapper, http://danifold.net/mapper/
 PCA: https://en.wikipedia.org/wiki/Principal_component_analysis
 SVM: https://en.wikipedia.org/wiki/Support_vector_machine
 MRMR: http://penglab.janelia.org/proj/mRMR/
 MAPE: https://en.wikipedia.org/wiki/Mean_absolute_percentage_error
 F1 Score: https://en.wikipedia.org/wiki/F1_score
22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 10
Question?
Kim Hee (kimheekimi@gmail.com)

More Related Content

PDF
서비스디자인컨설팅 활용가이드북 - 한국디자인진흥원
PPTX
Tech Trivia in collaboration with IIC MSIT
PPTX
Review of “Coping with Uncertainty in Planning” Karen S. Christensen (1985).pptx
PPTX
Introduction to Big Data/Machine Learning
PPTX
Kinema, The Science and Tech Quiz - Pragyanam 2015
PPTX
Explainability for Natural Language Processing
PPTX
Rendezvous 2021 sci-tech quiz finals
PPTX
K-Circle Lone Wolf Quiz January 2015
서비스디자인컨설팅 활용가이드북 - 한국디자인진흥원
Tech Trivia in collaboration with IIC MSIT
Review of “Coping with Uncertainty in Planning” Karen S. Christensen (1985).pptx
Introduction to Big Data/Machine Learning
Kinema, The Science and Tech Quiz - Pragyanam 2015
Explainability for Natural Language Processing
Rendezvous 2021 sci-tech quiz finals
K-Circle Lone Wolf Quiz January 2015

What's hot (17)

PDF
PDF
분석가과정Day2 공간분석과 시각화 slideshare
PDF
Chunking, Embeddings, and Vector Databases
PPTX
Politics, Politicians and Politicians Quiz
PPTX
Ontology mapping for the semantic web
PDF
Deep Neural Networks for Machine Learning
PDF
종 분포 모형 활용방안
PPTX
SAMARTH QUIZ 2024-25_ PRELIMINARY ROUNDS
PPTX
K-Circle Quiz Of The Month - July 2013
PPTX
Hallyu There!.pptx
PDF
Beyond Retrieval Augmented Generation (RAG): Vector Databases
PPTX
S07_E06: Insert _Quiz Name_ here | Deepanker & Rishabh.pptx
PDF
Requiz'em For A Dream - Prelims Answers
PPTX
World history quiz
PDF
디지털트윈, 스마트시티, 메타버스
PDF
[패스트캠퍼스] 야구선수 연봉예측
PDF
Explainable Ai.pdf
분석가과정Day2 공간분석과 시각화 slideshare
Chunking, Embeddings, and Vector Databases
Politics, Politicians and Politicians Quiz
Ontology mapping for the semantic web
Deep Neural Networks for Machine Learning
종 분포 모형 활용방안
SAMARTH QUIZ 2024-25_ PRELIMINARY ROUNDS
K-Circle Quiz Of The Month - July 2013
Hallyu There!.pptx
Beyond Retrieval Augmented Generation (RAG): Vector Databases
S07_E06: Insert _Quiz Name_ here | Deepanker & Rishabh.pptx
Requiz'em For A Dream - Prelims Answers
World history quiz
디지털트윈, 스마트시티, 메타버스
[패스트캠퍼스] 야구선수 연봉예측
Explainable Ai.pdf
Ad

Similar to TDA for feature selection (20)

PPTX
rbs - presentation about applications of machine learning.
PPTX
Parameter estimation of distributed hydrological model using polynomial chaos...
PDF
Analysis of quality metadata in the GEOSS Clearinghouse
PPSX
PPTX
PAPER_CODE__IE12
PPTX
Synthesis of analytical methods data driven decision-making
PDF
Universal approximators for Direct Policy Search in multi-purpose water reser...
PDF
Denis Reznik Data driven future
PDF
SE2016 BigData Denis Reznik "Data driven future"
PDF
ICIF19_Garg_job_talk_portfolio_modification.pdf
PDF
Handling Big Data in Ship Performance & Navigation Monitoring.
PPT
Six Sigma Mechanical Tolerance Analysis 1
PDF
2018 National Tanks Conference & Exposition: HRSC Data Visualization
PPTX
six sigma DMAIC approach for reducing quality defects of camshaft binding pro...
PPTX
Study on Application of Ensemble learning on Credit Scoring
PDF
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
PPTX
Morgan uw maGIV v1.3 dist
PPTX
ClusteringTechniques data analytics models
PPTX
Feature Engineering
PDF
Performance Comparison of Dimensionality Reduction Methods using MCDR
rbs - presentation about applications of machine learning.
Parameter estimation of distributed hydrological model using polynomial chaos...
Analysis of quality metadata in the GEOSS Clearinghouse
PAPER_CODE__IE12
Synthesis of analytical methods data driven decision-making
Universal approximators for Direct Policy Search in multi-purpose water reser...
Denis Reznik Data driven future
SE2016 BigData Denis Reznik "Data driven future"
ICIF19_Garg_job_talk_portfolio_modification.pdf
Handling Big Data in Ship Performance & Navigation Monitoring.
Six Sigma Mechanical Tolerance Analysis 1
2018 National Tanks Conference & Exposition: HRSC Data Visualization
six sigma DMAIC approach for reducing quality defects of camshaft binding pro...
Study on Application of Ensemble learning on Credit Scoring
OSMC 2009 | Anomalieerkennung und Trendvorhersagen an Hand von Daten aus Nagi...
Morgan uw maGIV v1.3 dist
ClusteringTechniques data analytics models
Feature Engineering
Performance Comparison of Dimensionality Reduction Methods using MCDR
Ad

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to machine learning and Linear Models
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
annual-report-2024-2025 original latest.
PPTX
Computer network topology notes for revision
Quality review (1)_presentation of this 21
Galatica Smart Energy Infrastructure Startup Pitch Deck
ISS -ESG Data flows What is ESG and HowHow
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to machine learning and Linear Models
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Acumen Training GuidePresentation.pptx
Introduction to Knowledge Engineering Part 1
Fluorescence-microscope_Botany_detailed content
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Clinical guidelines as a resource for EBP(1).pdf
IB Computer Science - Internal Assessment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Supervised vs unsupervised machine learning algorithms
annual-report-2024-2025 original latest.
Computer network topology notes for revision

TDA for feature selection

  • 1. Topological Data Analysis (TDA) and Use Cases Kim Hee (kimheekimi@gmail.com)
  • 2. Outline 1. Visualization by TDA 2. Insights Discovery & Feature Selection 3. Evaluate the Insights 22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 2
  • 3. Visualization 22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 3 Raw Data Filter/Filter&Metric NodesDivision with Redundancy Point Cloud e f g h A 3 7 10 12 B 4 8 11 13 C 5 9 8 10 D 13 11 8 4 Network Node A Edge Node B Node A Node B  L2(A,B)= 4 𝒃𝒚 𝐴𝑏𝑠 3 − 4 2 + 𝐴𝑏𝑠 7 − 8 2 + 𝐴𝑏𝑠 10 − 11 2 + 𝐴𝑏𝑠 12 − 13 2  L2(A,C)= 16 𝒃𝒚 𝐴𝑏𝑠 3 − 5 2 + 𝐴𝑏𝑠 7 − 9 2 + 𝐴𝑏𝑠 10 − 8 2 + 𝐴𝑏𝑠 12 − 10 2  L2(A,D)= 180 𝒃𝒚 𝐴𝑏𝑠 4 − 13 2 + 𝐴𝑏𝑠 8 − 11 2 + 𝐴𝑏𝑠 11 − 8 2 + 𝐴𝑏𝑠 13 − 4 2  cos(∠AOB)=0.999 𝒃𝒚 (𝟑×𝟒)+(𝟕×𝟖)+(𝟏𝟎×𝟏𝟏)+(𝟏𝟐×𝟏𝟑) 𝟑2+𝟕2+𝟏𝟎2+𝟏𝟐2 × 𝟒2+𝟖2+𝟏𝟏2+𝟏𝟑2 = 334 334.275  cos (∠AOC)=0.974 𝒃𝒚 (𝟑×𝟓)+(𝟕×𝟗)+(𝟏𝟎×𝟖)+(𝟏𝟐×𝟏𝟎) 𝟑2+𝟕2+𝟏𝟎2+𝟏𝟐2 × 𝟓2+𝟗2+𝟖2+𝟏𝟎2 = 278 285.552  cos(∠AOD)= 0.757 𝒃𝒚 (𝟒×𝟏𝟑)+(𝟖×𝟏𝟏)+(𝟏𝟏×𝟖)+(𝟏𝟑×𝟒) 𝟒2+𝟖2+𝟏𝟏2+𝟏𝟑2× 𝟏𝟑2+𝟏𝟏+𝟖2+𝟒2 = 280 370 Euclidean Distance, 𝐿2 𝑋, 𝑌 𝑖=1 𝑁 𝑋𝑖 − 𝑌𝑖 2 CosineSimilarity, cos θ 𝑖=1 𝑁 𝑋𝑖 × 𝑌𝑖 𝑖=1 𝑁 𝑋𝑖 2 × 𝑖=1 𝑁 𝑌𝑖 2 X, Y: data sample, Xi, Yi: each attribute, N: number of attributes 1. Visualization 2. Insights Discovery 3. Evaluation
  • 4. Insights Discovery Case 1 – Titanic 22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 4 1. Visualization 2. Insights Discovery 3. Evaluation
  • 5. Insights Discovery Case 2 – Energy Consumption  Problem Domain » Detect features that has correlation to the energy consumption  Data Description » Energy consumption history data in U.K. given by power plant » 1,096 rows * 8 attributes » Label attribute is volume, other are weather/calendar events  Apply TDA →Discovered insights: Volume is correlated to day_type and school_holiday 1. Visualization 2. Insights Discovery 3. Evaluation
  • 6. 22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 6 1. Visualization 2. Insights Discovery 3. Evaluation Insights Discovery Case 3 – High Dimensional Data  Problem Domain » Detect features that can predict customers who may terminate service  Data Description » Customer data given by Orange telecom » 50,000 rows * 233 attributes » Label attribute is churn (binary) » Other attributes are anonymous  Apply TDA Column Name Value Hypergeometric p-value churn 1 1.00E-12 Var202 PXLV 3.78E-04 Var199 Gai9lEF2Fr 4.19E-04 Var198 Z4hPoJV 4.82E-04 Var222 xiJRusu 4.82E-04 ⋮ ⋮ ⋮ Var220 Af96s0w 0.047965 Var220 rDm3DH0 0.047965 Var197 yMvB 0.049324 49 underlying features are captured (p-value that smaller than 0.05) The result of group comparison
  • 7. Time to evaluate the insights… 22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 7
  • 8. Evaluation Framework label sample 1 Y sample 2 N sample 3 Y sample 4 Y sample 5 N sample 6 Y sample 7 Y sample 8 Y sample 9 N sample 10 Y Method Selected Features Reduction Accuracy - all - 66% PCA 7 features 22.22% 0% RF 4 features 55.56% 33% TDA 2 features 77.78% 100% Sample Comparison Result 1. Visualization 2. Insights Discovery 3. Evaluation prediction 1 label result Y Y N Y Y Y prediction 3prediction 2 ModelingEvaluation Decision Tree FeatureSelection PCA TDAMRMR Model 1 Model 2 Model 3 Model 4 label result Y N N Y Y N label result Y N N N Y N label result Y Y N N Y Y test data (30%) Training data (70%) prediction 4
  • 9.  Energy Consumption 22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 9 Evaluation 1. Visualization 2. Insights Discovery 3. Evaluation Modeling Dimensional reduction All PCA MRMR TDA Reduction rate (no. of selected features) 0 % (0) 92.70 % (17) 57.08 % (100/default) 83.26 % (39) Evaluation (F1 Score) Model by Naïve Bayes 0.147 0.005 0.146 0.147 Evaluation (F1 Score) Model by Decision tree 0.016 0.002 0.023 0.036 Modeling Dimensional reduction All PCA MRMR TDA Reduction rate 0 % 66.67 % 88.89 % 77.78 % Selected features all winter, solar_rad, temp day_type day_type, sch_holiday Evaluation (MAPE) Model by Neural Network 3.0546 % 11.1026 % 5.7003 % 3.6406 % Model by SVM 10.9843 % 11.0649 % 10.6166 % 10.7778 %  High Dimensional Data
  • 10. References  Used tool: Ayasdi, http://www.ayasdi.com/  Open source: Mapper, http://danifold.net/mapper/  PCA: https://en.wikipedia.org/wiki/Principal_component_analysis  SVM: https://en.wikipedia.org/wiki/Support_vector_machine  MRMR: http://penglab.janelia.org/proj/mRMR/  MAPE: https://en.wikipedia.org/wiki/Mean_absolute_percentage_error  F1 Score: https://en.wikipedia.org/wiki/F1_score 22.01.2016 Kim Hee, “Topological Data Analysis and Use Cases” 10