2
Most read
4
Most read
2021 06-02-tabnet
2021 06-02-tabnet
1. TabNet inputs raw tabular data without any preprocessing and is trained using
gradient descent-based optimization.
2. TabNet uses sequential attention to choose which features to reason from at each
decision step.
-> interpretability and better learning as the learning capacity is used for the most salient features.
3. TabNet outperforms or is on par with other tabular learning models on various
datasets.
4. We show performance improvements by using unsupervised pre-training to
predict masked features.
Contribution
2021 06-02-tabnet
Feature selection broadly refers to judiciously picking a subset of
features based on their usefulness for prediction.
Feature selection & Tree-based learning
Their prominent strength is efficient picking of global features with the
most statistical information gain.
To improve the performance of standard DTs(Decision Trees), one
common approach is ensembling to reduce variance.
Random Forest : random subsets of data with randomly selected features to grow
many trees.
XGBoost & LightGBM : tree-based ensemble. (다음 기회에 ..)
Integration of DNNs into DTs ->
Self-supervied learning -> BERT
Feature selection & Tree-based learning
Conventional DNN blocks
 TabNet is based on such functionality and it outperforms DTs while reaping their benefits by
careful design which
(i) uses sparse instance-wise feature selection learned from data
(ii) constructs a sequential multi-step architecture, where each step contributes to a
portion of the decision based on the selected features
(iii) improves the learning capacity via nonlinear processing of the selected features
(iv) mimics ensembling via higher dimensions and more steps.
TabNet
𝑴 𝒊 ⋅ 𝒇
𝑴 𝒊
𝑴 𝒊
𝒂 𝒊 − 𝟏
𝑷 𝒊 − 𝟏
𝒅 𝒊
𝒂[𝒊]
BN : Ghost Batch Normalization
GLU : Gated Linear Unit
𝜼
Sparsemax
The idea is to set the probabilities of the smallest values of z to zero and keep only probabilities of the highest
values of z, but still keep the function differentiable to ensure successful application of backpropagation
algorithm.
출처 : https://towardsdatascience.com/what-is-sparsemax-f84c136624e4
𝑎 𝑖 − 1 ∶ 𝑀 𝑖 = 𝑠𝑝𝑎𝑟𝑠𝑒𝑚𝑎𝑥 𝑃 𝑖 − 1 ⋅ ℎ𝑖 𝑎 𝑖 − 1 Note, 𝑗=1
𝐷
𝑀 𝑖 𝑏,𝑗 = 1
 We pass the same 𝐷-dimensional features 𝑓 ∈ ℝ𝐵×𝐷
to each decision step, where 𝐵 is the batch size.
 𝑴 𝒊 ∈ ℝ𝐵×𝐷
for soft selection of the salient features. (𝑴 𝒊 ⋅ 𝒇)
 𝑃 𝑖 is the prior scale term, denoting how much a particular feature has been used previously
 𝑃 𝑖 = 𝑗=1
𝑖
(𝛾 − 𝑴 𝒋 )
 𝛾 : relaxation parameter
 𝛾 = 1, feature is enforced to be used only at one decision step and as 𝛾 increase, more flexibility is provided to use a
featue at multiple decision steps.
 𝑃 0 is initialized as all ones, 1𝐵×𝐷
, without any prior on the masked features.
𝐿𝑠𝑝𝑎𝑟𝑠𝑒 = 𝑖=1
𝑁𝑠𝑡𝑒𝑝𝑠
𝑏=1
𝐵
𝑗=1
𝐷 − 𝑴𝒃,𝒋 𝒊 log 𝑴𝒃,𝒋 𝒊 +𝜖
𝑁𝑠𝑡𝑒𝑝𝑠⋅𝐵
Feature Selection
 If 𝑀𝑏,𝑗 𝑖 = 0, then 𝑗𝑡ℎ
feature of the 𝑏𝑡ℎ
sample should have no contribution to the decision.
 If 𝑓𝑖 were a linear function, the coefficient 𝑀𝑏,𝑗 𝑖 would correspond to the feature importance of 𝑓𝑏,𝑗.
 We simply propose 𝜂𝑏 𝑖 = 𝑐=1
𝑁𝑑
𝑅𝑒𝐿𝑈 𝑑𝑏,𝑐 𝑖 to denote the aggregate decision contribution at 𝑖𝑡ℎ
decision step for the 𝑏𝑡ℎ
sample.
 If 𝑑𝑏,𝑐 𝑖 < 0, then all features at 𝑖𝑡ℎ
decision step should have 0 contribution to the overall decision.
 We propose aggregate feature importance mask,
𝑀𝑎𝑔𝑔−𝑏,𝑗 =
𝑖=1
𝑁𝑠𝑡𝑒𝑝𝑠
𝜂𝑏 𝑖 𝑀𝑏,𝑗 𝑖 /
𝑗=1
𝐷
𝑖=1
𝑁𝑠𝑡𝑒𝑝𝑠
𝜂𝑏 𝑖 𝑀𝑏,𝑗 𝑖
Interpretability
Tabular self-supervised learning
Propose the task of prediction of missing
feature columns from the others.
binary mask : 𝑆 ∈ 0,1 𝐵×𝐷
encoder inputs : 1 − 𝑆 ⋅ 𝑓
->
decoder outputs : 𝑆 ⋅ 𝑓
𝑃 0 = (1 − 𝑆)
reconstructure loss:
𝑏=1
𝐵
𝑗=1
𝐷 𝑓𝑏,𝑗 − 𝑓𝑏,𝑗 ⋅ 𝑆𝑏,𝑗
𝑏=1
𝐵
𝑓𝑏,𝑗 − 1/𝐵 𝑏=1
𝐵
𝑓𝑏,𝑗
2
2
Instant-wise feature selection (AUC)
The datasets are constructed in such a way that only a subset of the features determine the output.
Performance on real-world datasets
-S, -M, -L
Feature importance
T-SNE & Training curves

More Related Content

PDF
Tabnet presentation
PPTX
Branch And Bound and Beam Search Feature Selection Algorithms
PPTX
Complex Network Analysis
PPTX
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
PDF
Densely Connected Convolutional Networks
PPTX
Densely Connected Convolutional Networks
PDF
Graph Neural Network in practice
PDF
Neural turing machine
Tabnet presentation
Branch And Bound and Beam Search Feature Selection Algorithms
Complex Network Analysis
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Densely Connected Convolutional Networks
Densely Connected Convolutional Networks
Graph Neural Network in practice
Neural turing machine

What's hot (20)

PPTX
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...
PPTX
PDF
Predicting Influence and Communities Using Graph Algorithms
PDF
introduction to Dueling network
PDF
論文紹介:Dueling network architectures for deep reinforcement learning
PPT
AlphaGo Zero 解説
PPT
Graphs bfs dfs
PDF
ViT (Vision Transformer) Review [CDM]
PPTX
Attention Is All You Need
PDF
Graph neural networks overview
PDF
Introduction to Complex Networks
PDF
AutoML lectures (ACDL 2019)
PDF
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
PPTX
Convolution Neural Network (CNN)
PDF
Densenet CNN
PDF
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
PDF
[DL輪読会]Learning an Embedding Space for Transferable Robot Skills
PDF
近接分離最適化によるブラインド⾳源分離(Blind source separation via proximal splitting algorithm)
PDF
Transformerを多層にする際の勾配消失問題と解決法について
PPTX
Graph Representation Learning
[DL輪読会]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) 表形式デー...
Predicting Influence and Communities Using Graph Algorithms
introduction to Dueling network
論文紹介:Dueling network architectures for deep reinforcement learning
AlphaGo Zero 解説
Graphs bfs dfs
ViT (Vision Transformer) Review [CDM]
Attention Is All You Need
Graph neural networks overview
Introduction to Complex Networks
AutoML lectures (ACDL 2019)
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
Convolution Neural Network (CNN)
Densenet CNN
教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破...
[DL輪読会]Learning an Embedding Space for Transferable Robot Skills
近接分離最適化によるブラインド⾳源分離(Blind source separation via proximal splitting algorithm)
Transformerを多層にする際の勾配消失問題と解決法について
Graph Representation Learning
Ad

Similar to 2021 06-02-tabnet (20)

PDF
Restricting the Flow: Information Bottlenecks for Attribution
PDF
Machine Learning Notes for beginners ,Step by step
PDF
Deep learning concepts
PDF
Lesson_8_DeepLearning.pdf
PPTX
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
PPTX
Batch normalization presentation
PPT
deepnet-lourentzou.ppt
PPT
Deep learning is a subset of machine learning and AI
PPT
Overview of Deep Learning and its advantage
PPT
Introduction to Deep Learning presentation
PDF
Deep Feed Forward Neural Networks and Regularization
PPTX
Training DNN Models - II.pptx
PPTX
Machine Learning using Support Vector Machine
PPTX
04 Multi-layer Feedforward Networks
PPTX
Deep learning with TensorFlow
PDF
Reinforcement Learning - DQN
PPTX
supervised.pptx
PPTX
Deep learning from a novice perspective
PPTX
Feature Engineering Fundamentals Explained.pptx
PPTX
Deep Learning for Search
Restricting the Flow: Information Bottlenecks for Attribution
Machine Learning Notes for beginners ,Step by step
Deep learning concepts
Lesson_8_DeepLearning.pdf
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Batch normalization presentation
deepnet-lourentzou.ppt
Deep learning is a subset of machine learning and AI
Overview of Deep Learning and its advantage
Introduction to Deep Learning presentation
Deep Feed Forward Neural Networks and Regularization
Training DNN Models - II.pptx
Machine Learning using Support Vector Machine
04 Multi-layer Feedforward Networks
Deep learning with TensorFlow
Reinforcement Learning - DQN
supervised.pptx
Deep learning from a novice perspective
Feature Engineering Fundamentals Explained.pptx
Deep Learning for Search
Ad

More from JAEMINJEONG5 (20)

PPTX
Jaemin_230701_Simple_Copy_paste.pptx
PPTX
2022-01-17-Rethinking_Bisenet.pptx
PPTX
Swin transformer
PPTX
2021 05-04-u2-net
PPTX
2021 04-04-google nmt
PDF
2021 04-03-sean
PDF
2021 04-01-dalle
PDF
2021 03-02-spade
PDF
2021 03-02-distributed representations-of_words_and_phrases
PPTX
2021 03-02-transformer interpretability
PPTX
2021 03-01-on the relationship between self-attention and convolutional layers
PPTX
2021 01-04-learning filter-basis
PPTX
2021 01-02-linformer
PPTX
2020 12-04-shake shake
PPTX
2020 12-03-vit
PPTX
2020 12-2-detr
PPTX
2020 11 4_bag_of_tricks
PPTX
2020 11 2_automated sleep stage scoring of the sleep heart
PPTX
2020 11 1_sleep_net
PPTX
2020 12-1-adam w
Jaemin_230701_Simple_Copy_paste.pptx
2022-01-17-Rethinking_Bisenet.pptx
Swin transformer
2021 05-04-u2-net
2021 04-04-google nmt
2021 04-03-sean
2021 04-01-dalle
2021 03-02-spade
2021 03-02-distributed representations-of_words_and_phrases
2021 03-02-transformer interpretability
2021 03-01-on the relationship between self-attention and convolutional layers
2021 01-04-learning filter-basis
2021 01-02-linformer
2020 12-04-shake shake
2020 12-03-vit
2020 12-2-detr
2020 11 4_bag_of_tricks
2020 11 2_automated sleep stage scoring of the sleep heart
2020 11 1_sleep_net
2020 12-1-adam w

Recently uploaded (20)

PDF
Introduction to Power System StabilityPS
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PPT
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
PDF
UEFA_Embodied_Carbon_Emissions_Football_Infrastructure.pdf
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
Module 8- Technological and Communication Skills.pptx
PPTX
Measurement Uncertainty and Measurement System analysis
PDF
Prof. Dr. KAYIHURA A. SILAS MUNYANEZA, PhD..pdf
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPTX
wireless networks, mobile computing.pptx
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PPTX
CONTRACTS IN CONSTRUCTION PROJECTS: TYPES
PPTX
ai_satellite_crop_management_20250815030350.pptx
PPTX
Chapter 2 -Technology and Enginerring Materials + Composites.pptx
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PPTX
Feature types and data preprocessing steps
PDF
Unit1 - AIML Chapter 1 concept and ethics
PDF
20250617 - IR - Global Guide for HR - 51 pages.pdf
Introduction to Power System StabilityPS
August 2025 - Top 10 Read Articles in Network Security & Its Applications
Chapter 1 - Introduction to Manufacturing Technology_2.ppt
UEFA_Embodied_Carbon_Emissions_Football_Infrastructure.pdf
Soil Improvement Techniques Note - Rabbi
Module 8- Technological and Communication Skills.pptx
Measurement Uncertainty and Measurement System analysis
Prof. Dr. KAYIHURA A. SILAS MUNYANEZA, PhD..pdf
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Management Information system : MIS-e-Business Systems.pptx
wireless networks, mobile computing.pptx
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
CONTRACTS IN CONSTRUCTION PROJECTS: TYPES
ai_satellite_crop_management_20250815030350.pptx
Chapter 2 -Technology and Enginerring Materials + Composites.pptx
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
Feature types and data preprocessing steps
Unit1 - AIML Chapter 1 concept and ethics
20250617 - IR - Global Guide for HR - 51 pages.pdf

2021 06-02-tabnet

  • 3. 1. TabNet inputs raw tabular data without any preprocessing and is trained using gradient descent-based optimization. 2. TabNet uses sequential attention to choose which features to reason from at each decision step. -> interpretability and better learning as the learning capacity is used for the most salient features. 3. TabNet outperforms or is on par with other tabular learning models on various datasets. 4. We show performance improvements by using unsupervised pre-training to predict masked features. Contribution
  • 5. Feature selection broadly refers to judiciously picking a subset of features based on their usefulness for prediction. Feature selection & Tree-based learning Their prominent strength is efficient picking of global features with the most statistical information gain. To improve the performance of standard DTs(Decision Trees), one common approach is ensembling to reduce variance.
  • 6. Random Forest : random subsets of data with randomly selected features to grow many trees. XGBoost & LightGBM : tree-based ensemble. (다음 기회에 ..) Integration of DNNs into DTs -> Self-supervied learning -> BERT Feature selection & Tree-based learning
  • 8.  TabNet is based on such functionality and it outperforms DTs while reaping their benefits by careful design which (i) uses sparse instance-wise feature selection learned from data (ii) constructs a sequential multi-step architecture, where each step contributes to a portion of the decision based on the selected features (iii) improves the learning capacity via nonlinear processing of the selected features (iv) mimics ensembling via higher dimensions and more steps. TabNet
  • 9. 𝑴 𝒊 ⋅ 𝒇 𝑴 𝒊 𝑴 𝒊 𝒂 𝒊 − 𝟏 𝑷 𝒊 − 𝟏 𝒅 𝒊 𝒂[𝒊] BN : Ghost Batch Normalization GLU : Gated Linear Unit 𝜼
  • 10. Sparsemax The idea is to set the probabilities of the smallest values of z to zero and keep only probabilities of the highest values of z, but still keep the function differentiable to ensure successful application of backpropagation algorithm. 출처 : https://towardsdatascience.com/what-is-sparsemax-f84c136624e4 𝑎 𝑖 − 1 ∶ 𝑀 𝑖 = 𝑠𝑝𝑎𝑟𝑠𝑒𝑚𝑎𝑥 𝑃 𝑖 − 1 ⋅ ℎ𝑖 𝑎 𝑖 − 1 Note, 𝑗=1 𝐷 𝑀 𝑖 𝑏,𝑗 = 1
  • 11.  We pass the same 𝐷-dimensional features 𝑓 ∈ ℝ𝐵×𝐷 to each decision step, where 𝐵 is the batch size.  𝑴 𝒊 ∈ ℝ𝐵×𝐷 for soft selection of the salient features. (𝑴 𝒊 ⋅ 𝒇)  𝑃 𝑖 is the prior scale term, denoting how much a particular feature has been used previously  𝑃 𝑖 = 𝑗=1 𝑖 (𝛾 − 𝑴 𝒋 )  𝛾 : relaxation parameter  𝛾 = 1, feature is enforced to be used only at one decision step and as 𝛾 increase, more flexibility is provided to use a featue at multiple decision steps.  𝑃 0 is initialized as all ones, 1𝐵×𝐷 , without any prior on the masked features. 𝐿𝑠𝑝𝑎𝑟𝑠𝑒 = 𝑖=1 𝑁𝑠𝑡𝑒𝑝𝑠 𝑏=1 𝐵 𝑗=1 𝐷 − 𝑴𝒃,𝒋 𝒊 log 𝑴𝒃,𝒋 𝒊 +𝜖 𝑁𝑠𝑡𝑒𝑝𝑠⋅𝐵 Feature Selection
  • 12.  If 𝑀𝑏,𝑗 𝑖 = 0, then 𝑗𝑡ℎ feature of the 𝑏𝑡ℎ sample should have no contribution to the decision.  If 𝑓𝑖 were a linear function, the coefficient 𝑀𝑏,𝑗 𝑖 would correspond to the feature importance of 𝑓𝑏,𝑗.  We simply propose 𝜂𝑏 𝑖 = 𝑐=1 𝑁𝑑 𝑅𝑒𝐿𝑈 𝑑𝑏,𝑐 𝑖 to denote the aggregate decision contribution at 𝑖𝑡ℎ decision step for the 𝑏𝑡ℎ sample.  If 𝑑𝑏,𝑐 𝑖 < 0, then all features at 𝑖𝑡ℎ decision step should have 0 contribution to the overall decision.  We propose aggregate feature importance mask, 𝑀𝑎𝑔𝑔−𝑏,𝑗 = 𝑖=1 𝑁𝑠𝑡𝑒𝑝𝑠 𝜂𝑏 𝑖 𝑀𝑏,𝑗 𝑖 / 𝑗=1 𝐷 𝑖=1 𝑁𝑠𝑡𝑒𝑝𝑠 𝜂𝑏 𝑖 𝑀𝑏,𝑗 𝑖 Interpretability
  • 13. Tabular self-supervised learning Propose the task of prediction of missing feature columns from the others. binary mask : 𝑆 ∈ 0,1 𝐵×𝐷 encoder inputs : 1 − 𝑆 ⋅ 𝑓 -> decoder outputs : 𝑆 ⋅ 𝑓 𝑃 0 = (1 − 𝑆) reconstructure loss: 𝑏=1 𝐵 𝑗=1 𝐷 𝑓𝑏,𝑗 − 𝑓𝑏,𝑗 ⋅ 𝑆𝑏,𝑗 𝑏=1 𝐵 𝑓𝑏,𝑗 − 1/𝐵 𝑏=1 𝐵 𝑓𝑏,𝑗 2 2
  • 14. Instant-wise feature selection (AUC) The datasets are constructed in such a way that only a subset of the features determine the output.